-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill descendant processes in core.direct
schedulers plugin
#6572
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6572 +/- ##
==========================================
+ Coverage 77.51% 77.85% +0.35%
==========================================
Files 560 566 +6
Lines 41444 42044 +600
==========================================
+ Hits 32120 32730 +610
+ Misses 9324 9314 -10 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agoscinski , really fast in hunting bugs :)
I've put a minor comment,
In anycase, would be nice to add some regression tests.
process_ids.extend([str(child.pid) for child in children]) | ||
process_ids_str = ' '.join(process_ids) | ||
|
||
submit_command = f'kill {process_ids_str}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side node:
I've encountered cases where kill PID
silently returns without actually killing a job.
I would suggest handling this scenario, if PID still exists after sending the command kill PID
.
then properly inform with a log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agoscinski . Tests seem to be hanging so need to fix those and have a few comments
def _get_kill_command(self, jobid): | ||
"""Return the command to kill the job with specified jobid.""" | ||
submit_command = f'kill {jobid}' | ||
def _get_kill_command(self, process_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing jobid
to process_id
you broke the log line on line 370. Either keep it as jobid
or adapt other lines that referenced it accordingly. This would be a breaking change, but since it is an internal method it is ok to change
# get a list of the process id of all descendants | ||
process = Process(int(process_id)) | ||
children = process.children(recursive=True) | ||
process_ids = [process_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should cast to str
here explicitly to be safe. Before, it was used in an f-string, which automatically casts, but now you are using it as arguments to ' '.join()
which will fail if the elements are not all strings.
process_ids = [process_id] | |
process_ids = [str(process_id)] |
process_ids.extend([str(child.pid) for child in children]) | ||
process_ids_str = ' '.join(process_ids) | ||
|
||
submit_command = f'kill {process_ids_str}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well take the opportunity to fix the variable name
submit_command = f'kill {process_ids_str}' | |
kill_command = f'kill {process_ids_str}' |
Proposal to solve #6571
In the direct scheduler we use
psutil
to obtain a list of descendant processes so we can kill all of them. This issue does not happen in the other scheduler as the job scheduler takes care of this. Here we have to manage the killing of the descendants by ourself.