Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request Collection for WorkGraph #6561

Open
3 tasks
superstar54 opened this issue Aug 30, 2024 · 4 comments
Open
3 tasks

Feature Request Collection for WorkGraph #6561

superstar54 opened this issue Aug 30, 2024 · 4 comments
Labels
type/feature request status undecided

Comments

@superstar54
Copy link
Member

superstar54 commented Aug 30, 2024

To gain an overall view of feature requests related to WorkGraph, I'm opening this issue to collect and track them in one place. While creating individual issues for each request is also beneficial, this list will help maintain an overview, especially since some of these issues are interrelated. I want to ensure developers can easily locate relevant feature requests without having to sift through hundreds of other issues.

Feature Requests

Note: this list will be continuously updated.

  • Allow Passing Parent PID to the Process During instantiate_process:
    In aiida-core, this is handled automatically. However, in WorkGraph, we need more control over this process to ensure proper task scheduling.

  • Enable Submission of calcfunction and workfunction to the Daemon:
    This feature would prevent these functions from blocking the scheduling and response of processes.

  • Support multiple queues
    We may want a process (Scheculer) to listen to a task queue that is different from the default one. kiwipy supports it, but in aiida-core, we always use the default one, and does not expose the API for selecting queues, e.g., in manager.create_daemon_runner., RemoteProcessThreadController.task_send

Discussion and Implementation

We can discuss whether to implement these features directly in aiida-core or elsewhere. If anyone is interested in taking on any of the above features, please feel free to proceed.

@superstar54 superstar54 added the type/feature request status undecided label Aug 30, 2024
@sphuber
Copy link
Contributor

sphuber commented Aug 30, 2024

Thanks @superstar54 . Could you provide a bit more technical context for "Allow Passing Parent PID to the Process During instantiate_process". Please link to the relevant code in aiida-workgraph explain what you are trying to accomplish and what you currently cannot do.

As for allowing to submit process functions to the daemon: the tricky part here is just to make sure the daemon can actually import the function code. If the function is importable, then there should be no problem. However, process functions currently do not always have to be importable, they could be defined inline in a shell, notebook, or script, in which case they cannot be imported by the daemon. One could try to resort to using pickling off the function and have the daemon unpickle it instead of importing. But we have to be careful that when we enable this, that the user won't be confused/frustrated when sometimes submitting to the daemon works, and other times it doesn't.

Also, could you please explain exactly why process functions not being submittable is causing a problem at the moment for scheduling in aiida-workgraph?

@superstar54
Copy link
Member Author

superstar54 commented Aug 30, 2024

Hi @sphuber, thanks for your comments!

Using parent_pid

When running a workflow (such as a WorkChain or WorkGraph), each workflow is associated with a corresponding process. This process launches and waits for the child processes (e.g., CalcJob processes). In nested workflows like the PwBandsWorkChain, you may encounter multiple Workflow processes in a waiting state, with only one CalcJob process actively running. These waiting Workflow processes can be seen as inefficient resource usage.

In a WorkChain, the workflow logic is encapsulated within the new WorkChain class, making it challenging to eliminate these waiting processes at the moment. However, in a WorkGraph, the logic is more explicitly defined, and it has strict rules on who can execute this logic.

To address this, I proposed a Scheduler for the WorkGraph in this draft PR. The Scheduler handles the following:

  • It creates the WorkGraph process only in the database without actually running the process by a daemon worker.
  • It analyzes task dependencies and, if the task is a CalcJob, it launches it to the daemon worker as usual. The key difference here is that the Scheduler uses the WorkGraph's PK as the parent PID, thereby maintaining correct provenance.

Let's compare the process count for the PwBands case. Suppose we launch 100 PwBands WorkGraphs:

  • Old Approach: 300 Workflow processes (Bands, Relax, Base), 100 CalcJob processes.
  • New Approach: 1 Scheduler process, 100 CalcJob processes. 300 Workflow processes but not active.

The benefit is clear: the new approach significantly reduces the number of active processes. Moreover, the Scheduler runs in a separate daemon that does not listen to process launching tasks, thereby eliminating the possibility of deadlocks that could occur with the old approach.

Regarding performance concerns, will the Scheduler struggle to handle 300 WorkGraphs? Based on my current tests, performance remains comparable. However, it's worth noting that I did not run calcfunctions, which could will block the Scheduler. Therefore, I’m requesting support for submitting calcfunctions.

One could try to resort to using pickling off the function and have the daemon unpickle it instead of importing. But we have to be careful that when we enable this, that the user won't be confused/frustrated when sometimes submitting to the daemon works, and other times it doesn't.

Yes, using pickle is indeed a viable option. Since both pickling and unpickling occur within the same environment, it should function correctly.

I @ you in the PR where I modified the instantiate_process function. The PR is in draft mode, and the code is a little messy, so you don't need to dive into it for the moment 😆

@sphuber
Copy link
Contributor

sphuber commented Sep 4, 2024

Thanks @superstar54 . I think I understand a bit better what you are trying to accomplish. Did you ever measure the "cost" of having the idling parent processes that are sitting in a daemon worker? Do we actually know that they use a lot of resources? Is it mostly CPU cost that you are worried about or memory?

Your idea may however be interesting in solving the problem where daemon workers have all their process slots with waiting parent processes. This has the potential for causing a deadlock. Although that has never really produced itself in practice, it is still not ideal and we work around it by allowing to increase the number of slots per worker. If we could elegantly solve this, that would be great.

I am not so sure about the implementation though. I might not fully understand it yet, but if I understand correctly, you implement a new Process called the WorkGraphScheduler which functions sort of as the entry point to run other WorkGraphs? And then there is a separate daemon with its own workers (and RMQ queue?!) to handle these. How would the user work with these in practice? Do they know have to manage two separate daemons? Do they have to relaunch the WorkGraphScheduler each time or is this associated with a single WorkGraph. In short, could you provide an example of how you imagine this being used in real use-cases.

And I think I now understand why you need the parent_id to be passed to the instantiate_process. Normally the parent_id is set in the Process.on_create which will fetch it simply from the current active process (by taking it from the top of the call stack) which should always be the parent process. But in your setup, the parent process is not actually running when its child process is created, correct? I guess it makes sense to allow instantiate_process to specify it because the Process constructor also already allows for the parent_pid to be set.

@superstar54
Copy link
Member Author

Yes, your understanding is correct. Here is an example use case. Users need to start a separate daemon for the Scheduler Process.

workgraph scheduler start

This will create a runner, and launch a WorkGraphScheduler process (restart if a process exist)

To submit a WorkGraph to the scheduler, set the to_scheduler flag to True:

wg.submit(to_scheduler=True)

This will send a msg to the workgraph_queue in RMQ. The WorkGraphScheduler process will pick up the WorkGraph from the queue. The maximum number of WorkGraph, a WorkGraphScheduler process can pick up from the queue is limited by the prefecth_count (e.g., 200). Inside the Scheduler process, it analyzes the tasks' dependencies:

  • if the task is a calcjob submits the task to the normal daemon (let's call it task daemon, compared to the scheduler daemon), and wait for it to finish and run the callback, similar to WorkChain.
  • if the task is a workgraph, there is no need to submit it; the WorkGraphScheduler process will pick up it directly. If a WorkGraph submit many sub-WorkGraph, there could be a performance issue here. However, if we submit the WorkGraph to the queue and wait for a Scheduler to pick it up, there could be a deadlock issue again.

Of couse, users can still submit the WorkGraph directly without the Scheduler.

wg.submit()

Do we actually know that they use a lot of resources? Is it mostly CPU cost that you are worried about or memory?

I haven't measured the CPU and memory usage yet, but I believe that idling processes don’t add much to either. Initially, my concern was that idle processes would occupy worker slots (e.g., with a default limit of 200), requiring users to increase the number of workers. Since the number of workers is typically limited by the number of CPUs, I saw this as an issue. However, as you pointed out, users can increase the number of slots per worker, meaning idle processes won’t necessarily waste slots.

That said, increasing the number of slots could overwhelm a worker and prevent the WorkGraph from scheduling new tasks efficiently. Tasks like calcfunction or calcjob (e.g., its parser) may take a long time, potentially blocking the worker. This is why I want to implement a dedicated Scheduler that only handles WorkGraph-related tasks and doesn't accept others. A Scheduler focused on WorkGraph tasks would only be responsible for analyzing task dependencies and submitting tasks, which shouldn’t overload it.

Other upsides of the scheduler: 1) we may control the total number of running tasks (e.g., limit the number of calcjob on a computer); 2) we may control the priority of the workgraph.

The downsides of the scheduler: the user needs to maintain a separate daemon and a special Scheduler process.

This is still an idea in development, and it's currently implemented in a draft PR. Any comments and suggestions are welcomed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature request status undecided
Projects
None yet
Development

No branches or pull requests

2 participants