`core.direct` scheduler: `kill` command doesn't stop the underlying jobs #6571

t-reents · 2024-09-25T09:45:00Z

Describe the bug

I encountered the problem that my jobs are not properly killed when running on localhost using the core.direct scheduler. For example, when I kill a PwCalculation, the corresponding CalcJobNode disappears from the verdi process list and is marked as killed. However, the underlying pw.x jobs are still running according my CPU consumption (also visible if I run the top command).

Your environment

Operating system [e.g. Linux]: Linux
Python version [e.g. 3.7.1]: Python 3.10.12
aiida-core version [e.g. 1.2.1]: 2.5.1 and 2.6.2

Additional context

Initially, I thought that this is related to the verdi presto command, as I only observed this behavior with my presto profiles. However, after manually creating a new computer and specifying core.slurm as the scheduler, the problem disappeared. Therefore, it really seems to be related to the core.direct scheduler.

The text was updated successfully, but these errors were encountered:

agoscinski · 2024-09-26T08:41:56Z

Can reproduce this with

from aiida import load_profile, engine, orm
load_profile()

builder = orm.load_code("bash@localhost").get_builder()
builder.x = orm.Int(2)
builder.y = orm.Int(3)
builder.metadata.options.sleep = 100000
engine.run(builder)

The process is called sleep and still persists after a kill. The python instance where the calcjob is started is killed, but I still see sleep in my process list

agoscinski · 2024-09-26T12:48:54Z

Unlike I initially thought kill just kills the process and not the children processes. So in the code above with this pstree

-+- 97560 alexgo bash _aiidasubmit.sh
 \-+- 97561 alexgo /opt/homebrew/bin/bash
   \--- 97562 alexgo sleep 100000

the sleep will be kept alive even if we kill the parents process. One solution would be to change the kill_job function to send a kill command for all descendant process. So here

aiida-core/src/aiida/schedulers/plugins/bash.py

Lines 62 to 74 in 72a6b18

    
               def kill_job(self, jobid: str) -> bool: 
        
                   """Kill a remote job and parse the return value of the scheduler to check if the command succeeded. 
        
                   ..note:: 
        
                       On some schedulers, even if the command is accepted, it may take some seconds for the job to actually 
        
                       disappear from the queue. 
        
                   :param jobid: the job ID to be killed 
        
                   :returns: True if everything seems ok, False otherwise. 
        
                   """ 
        
                   retval, stdout, stderr = self.transport.exec_command_wait(self._get_kill_command(jobid)) 
        
                   return self._parse_kill_output(retval, stdout, stderr)

we do

    def kill_job(self, jobid: str) -> bool:
        """Kill a remote job and parse the return value of the scheduler to check if the command succeeded.

        ..note::

            On some schedulers, even if the command is accepted, it may take some seconds for the job to actually
            disappear from the queue.

        :param jobid: the job ID to be killed
        :returns: True if everything seems ok, False otherwise.
        """
        import psutil
        process = psutil.Process(int(jobid))
        children = process.children(recursive=True)
        jobids = [str(child.pid) for child in children]
        jobids.append(jobid)
        retval, stdout, stderr = self.transport.exec_command_wait(self._get_kill_command(" ".join(jobids))
        return self._parse_kill_output(retval, stdout, stderr)

EDIT: this should be moved to the direct scheduler in the _get_kill_command function as it otherwise will interfere with slurm job ids

t-reents added the type/bug label Sep 25, 2024

agoscinski added a commit to agoscinski/aiida-core that referenced this issue Sep 26, 2024

WIP: solve issue aiidateam#6571

c404f97

agoscinski added a commit to agoscinski/aiida-core that referenced this issue Sep 26, 2024

WIP: solve issue aiidateam#6571

959574e

agoscinski mentioned this issue Sep 26, 2024

Kill descendant processes in core.direct schedulers plugin #6572

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`core.direct` scheduler: `kill` command doesn't stop the underlying jobs #6571

`core.direct` scheduler: `kill` command doesn't stop the underlying jobs #6571

t-reents commented Sep 25, 2024

agoscinski commented Sep 26, 2024 •

edited

Loading

agoscinski commented Sep 26, 2024 •

edited

Loading

core.direct scheduler: kill command doesn't stop the underlying jobs #6571

core.direct scheduler: kill command doesn't stop the underlying jobs #6571

Comments

t-reents commented Sep 25, 2024

Describe the bug

Your environment

Additional context

agoscinski commented Sep 26, 2024 • edited Loading

agoscinski commented Sep 26, 2024 • edited Loading

`core.direct` scheduler: `kill` command doesn't stop the underlying jobs #6571

`core.direct` scheduler: `kill` command doesn't stop the underlying jobs #6571

agoscinski commented Sep 26, 2024 •

edited

Loading

agoscinski commented Sep 26, 2024 •

edited

Loading