Skip to content

Job processing (introduction)

David Anderson edited this page Jan 25, 2024 · 10 revisions

The job processing pipeline

BOINC is designed for processing large numbers of jobs - possibly millions per project per day. Jobs move through a software pipeline:

The job-processing pipeline. Application-specific components are in heavy boxes.


The pipeline stages:

  • A work generator creates jobs and moves their input files into a 'staging area' from which they can be accessed via HTTP by worker nodes. Work generators are application specific. They may work by creating batches of jobs, or by 'streaming' jobs into the pipeline in a way that keeps it full but not overflowing.
  • The scheduler (part of BOINC) dispatches jobs to clients.
  • The client downloads input files as needed, runs the application program, and uploads output files when the job is completed.
  • A validator examines output files and tries to verify that they are correct. These are in general application-specific. They may syntax-check individual output files, or they may compare pairs of output files to do replication-based validation.
  • An assimilator handles a completed and validated job. It's application-specific. It may move output files from BOINC's staging area to a final destination, or it may parse output files and insert their information into a database.

The application-specific parts of the pipeline can be written in C++ (for maximum performance) or in scriptng languages like Python and PHP.

Staging input and output files

BOINC stores input and output files in staging areas on the server, which can be accessed via HTTP by clients. These area are 2-level directory hierarchies, one for uploads, one for downloads. Each hierarchy has 1024 subdirectories (named '0' to '3ff'). A file is placed in a subdirectory based on the MD5 hash of its filename.

This was done because projects can have large numbers (millions) of files at a given time, and some filesystems become extremely slow when a single directory contains that many files.

Before a job is submitted, its input files are moved or copied to the download staging area. When a client completes a job, it uploads the output files to the appropriate place in the upload staging area.

Identity of job submitters

In projects with multiple job submitters, job submitters are identified by their user account on the project (the same kind of accounts as volunteers; job submitters can also act as resource providers).

The administrator of a BOINC project can grant (using the admin web interface) job-submission privileges to users, either for particular apps or for all apps. Job submitters can be given different 'resource shares' (e.g. one user gets 50% of the project's computing, another 20%, another 30%) and the scheduler enforces these shares.

Batches

For convenience, jobs may be collected into batches. Batches may be monitored and controlled as a unit. A batch is associated with the user ID of its submitter.

Interfaces for job submission and file management

There are three general ways to submit jobs and manage the associated files:

Server-level

In this approach, job submitters (scientists) log in to the server and submit jobs by running command-line programs.

Remote via RPC

BOINC provides a set of XML/HTTP RPCs for managing files and submitting jobs. These RPCs have Python bindings. This makes it possible create systems that allow scientists to process jobs without logging into the BOINC server.

Remote via web interface

You can develop application-specific web interfaces that run on the BOINC project server, and that allow scientists to manage files and submit jobs remotely via a web browser.

Failures and retries

A job can fail on a BOINC worker node for a variety of reasons:

  • The application crashes.
  • The user on that node aborts the job.
  • The job exceeds its memory or disk space limits.
  • The job times out.

In some cases the job would succeed if it were run on a different node.

By default jobs are not retried. But you can configure an app so that jobs are retried up to some limit: if a job fails on a node, a second copy (or 'instance') of the job is sent to a different node. This is repeated until an instance succeeds, or until the limit is reached, in which case the job is marked as failing and no further instances are created.

Clone this wiki locally