What It Does (10 minute version)
================================

Details on what PyTaskFarmer does behind the scenes. This might help in case you
see unexpected behavior or want to know what the script is capable of.

PyTaskFarmer uses a series of files to track the progress of processing a
tasklist. The usage of a shared file system for communication between multiple
instances of PyTaskFarmer removes the need for an omniresent scheduler process.

All files are stored inside the working directory (`workdir`). The access to
them is protected using file locking mechanisms to prevent race conditions among
workers or multiple instances of PyTaskFarmer. Locks are written to
:code:`SCRATCH` as other file systems at NERSC do not support file locking. In
case the system does not define `SCRATCH`, a `lock` file inside `workdir` is
used instead.

The PyTaskFamer program does the following:

1. The tasklist is parsed by a tasklist handler to create the list of commands
   to execute (`tasks`). If the `toprocess` file does not exist, then the tasks
   are stored there. If the file already exists, the assumption is that you are
   re-starting the task farmer and it should continue from where the last farmer
   left off. This also allows for multiple farmers (on multiple nodes)
   processing the same same task list. Such parallel farmers do not compete with
   each other, but share the tasks.

2. The requested number of workers (:code:`multiprocess.Pool`) is intitiated.

3. The workers are assigned a list of jobs that matches the total number of
   input tasks to process in parallel. Each job fetches the next available task.
   If no more tasks are available, then the job finishes quickly. This
   guarantees that enough jobs are spawened to process all tasks, but does not
   waste resources on completed tasks.

4. Each task is passed to a runner that then executes it. The runner can perform
   additional steps like setup the environment or execute the task inside a
   container (ie: shifter).

5. The output/error stream of each task is stored in a logfile at
   :code:`logs/log_N`, where `N` is the task ID. The task ID corresponds to
   the order that the command is written in the original `toprocess` file
   starting at 0. Note that the tasks might not finish in this order.

6. The task's exit code is used to put it into the `finished` or `failed` file
   upon completion. Exit code 0 indicates success.

7. If the farmer catches either a timeout or a :code:`SIGUSR1`, then the
   worker pool is immediately killed in a clean fasion. Any tasks that are being
   executed are added back to the `toprocess` list.

**Note**

- The workers don't know (or care) what command they run. That means if your
  single-line commands use 4 threads at a time, then you can ask PyTaskFarmer to
  run 64/4=16 processes and it will have 16 four-thread processes running at a
  time.

- If your program can fully utilize a node (64 threads on Cori Haswell), then
  you can ask the farmer to run one process at a time. This is equivalent to
  running the commands in your text file in order, but with support for
  checkpointing the per-file progress.