What It Does (10 minute version) ================================ Details on what PyTaskFarmer does behind the scenes. This might help in case you see unexpected behavior or want to know what the script is capable of. PyTaskFarmer uses a series of files to track the progress of processing a tasklist. The usage of a shared file system for communication between multiple instances of PyTaskFarmer removes the need for an omniresent scheduler process. All files are stored inside the working directory (`workdir`). The access to them is protected using file locking mechanisms to prevent race conditions among workers or multiple instances of PyTaskFarmer. Locks are written to :code:`SCRATCH` as other file systems at NERSC do not support file locking. In case the system does not define `SCRATCH`, a `lock` file inside `workdir` is used instead. The PyTaskFamer program does the following: 1. The tasklist is parsed by a tasklist handler to create the list of commands to execute (`tasks`). If the `toprocess` file does not exist, then the tasks are stored there. If the file already exists, the assumption is that you are re-starting the task farmer and it should continue from where the last farmer left off. This also allows for multiple farmers (on multiple nodes) processing the same same task list. Such parallel farmers do not compete with each other, but share the tasks. 2. The requested number of workers (:code:`multiprocess.Pool`) is intitiated. 3. The workers are assigned a list of jobs that matches the total number of input tasks to process in parallel. Each job fetches the next available task. If no more tasks are available, then the job finishes quickly. This guarantees that enough jobs are spawened to process all tasks, but does not waste resources on completed tasks. 4. Each task is passed to a runner that then executes it. The runner can perform additional steps like setup the environment or execute the task inside a container (ie: shifter). 5. The output/error stream of each task is stored in a logfile at :code:`logs/log_N`, where `N` is the task ID. The task ID corresponds to the order that the command is written in the original `toprocess` file starting at 0. Note that the tasks might not finish in this order. 6. The task's exit code is used to put it into the `finished` or `failed` file upon completion. Exit code 0 indicates success. 7. If the farmer catches either a timeout or a :code:`SIGUSR1`, then the worker pool is immediately killed in a clean fasion. Any tasks that are being executed are added back to the `toprocess` list. **Note** - The workers don't know (or care) what command they run. That means if your single-line commands use 4 threads at a time, then you can ask PyTaskFarmer to run 64/4=16 processes and it will have 16 four-thread processes running at a time. - If your program can fully utilize a node (64 threads on Cori Haswell), then you can ask the farmer to run one process at a time. This is equivalent to running the commands in your text file in order, but with support for checkpointing the per-file progress.