-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core.direct
scheduler: kill
command doesn't stop the underlying jobs
#6571
Comments
Can reproduce this with from aiida import load_profile, engine, orm
load_profile()
builder = orm.load_code("bash@localhost").get_builder()
builder.x = orm.Int(2)
builder.y = orm.Int(3)
builder.metadata.options.sleep = 100000
engine.run(builder) The process is called |
Unlike I initially thought
the sleep will be kept alive even if we kill the parents process. One solution would be to change the aiida-core/src/aiida/schedulers/plugins/bash.py Lines 62 to 74 in 72a6b18
we do def kill_job(self, jobid: str) -> bool:
"""Kill a remote job and parse the return value of the scheduler to check if the command succeeded.
..note::
On some schedulers, even if the command is accepted, it may take some seconds for the job to actually
disappear from the queue.
:param jobid: the job ID to be killed
:returns: True if everything seems ok, False otherwise.
"""
import psutil
process = psutil.Process(int(jobid))
children = process.children(recursive=True)
jobids = [str(child.pid) for child in children]
jobids.append(jobid)
retval, stdout, stderr = self.transport.exec_command_wait(self._get_kill_command(" ".join(jobids))
return self._parse_kill_output(retval, stdout, stderr) EDIT: this should be moved to the direct scheduler in the |
Describe the bug
I encountered the problem that my jobs are not properly killed when running on
localhost
using thecore.direct
scheduler. For example, when I kill aPwCalculation
, the correspondingCalcJobNode
disappears from theverdi process list
and is marked as killed. However, the underlyingpw.x
jobs are still running according my CPU consumption (also visible if I run thetop
command).Your environment
Additional context
Initially, I thought that this is related to the
verdi presto
command, as I only observed this behavior with mypresto
profiles. However, after manually creating a newcomputer
and specifyingcore.slurm
as the scheduler, the problem disappeared. Therefore, it really seems to be related to thecore.direct
scheduler.The text was updated successfully, but these errors were encountered: