Running at NERSC ================ Tips and tricks for using the PyTaskFarmer on NERSC machines (ie: cori). You can use PyTaskFarmer a part of your top-level batch script for submissions into the NERSC slurm batch system. There are a variety of examples for running multi-core or multi-node jobs `available here`_. .. _`available here`: https://docs.nersc.gov/jobs/examples/ Equalize Task Running Time -------------------------- The farmer likes to have more work than workers, in order to keep those workers busy at all times. That means if you have tasks that might be different lengths (e.g. MC and data, or different size datasets, etc), it is very important to 1. put the longer tasks earlier in the list, 2. have a total run time that is longer than the longest job (preferably by a factor of 2 or more) and 3. request a number of cores that will be kept busy by your jobs. For example, if you expect to have one 1-hour job and ten 5-minute jobs, you can requests two threads; one thread will process the 1-hour job and the other thread will process all the 5-minute jobs. This relies on your ordering the task list well -- if you make the 1-hour job last, then the two threads will work through all your 5-minute jobs in about 25 minutes and then one will process the 1-hour job while the other sit idle (and wastes CPU). This requires some thought and care, but can save us significant numbers of hours, so please do think carefully about what you're running! Clean-up In Batch Jobs ---------------------- The farmer can be used in any queue at NERSC. One of the better options if some work needs doing but is not urgent is to use the flex queue on KNL. When submitting into that queue, one must add :code:`--time-min=01:30:00 --time=10:00:00`, where the first is the minimum time that the farmer should be run (cannot be not be longer than 2 hours) and should be longer than a typical command you need to execute. The second is the total wall time for the job, which must be less than 12 hours. Jobs in this queue will be started, killed, and restarted from checkpoints. Add to your job script .. code-block:: sh # requeueing the job if reamining time >0 (do not change the following 3 lines ) . /usr/common/software/variable-time-job/setup.sh requeue_job func_trap USR1 in order to have the job automatically re-queued so that it will continue to run. You should also add to your run script .. code-block:: sh #SBATCH --signal=B:USR1@10 To give the job 10 seconds to handle the USR1 signal (it should not need that long, but in case there are multiple workers fighing for the same lock). For the check-pointing, please also add these to your job script: .. code-block:: sh # use the following three variables to specify the time limit per job (max_timelimit), # the amount of time (in seconds) needed for checkpointing, # and the command to use to do the checkpointing if any (leave blank if none) max_timelimit=12:00:00 # can match the #SBATCH --time option but don't have to ckpt_overhead=60 # should match the time in the #SBATCH --signal option ckpt_command= Note that these are in addition to the usual sbatch specifications, and it is quite important that they match. Extra Memory ------------ If you have serious memory issues, then it is possible to enable swap space when running in a full node queue (e.g. regular; this is not possible in the shared queue). To do so, make a burst-buffer config file like: .. code-block:: sh $ cat bb_swap.conf #DW jobdw capacity=160GB access_mode=striped type=scratch #DW swap 150GB This uses the Cray `DataWarp configuration format`_. The second line is the important one here; it provides 150 GB of swap space within the burst buffer. The first line describes the scratch space reservation that your job needs, and may be unnecessary or even problematic depending on where you write your inputs and outputs for the job (think about what it's doing before sending the command off to the queue). You can then add it to your job submission like: .. code-block:: sh salloc ... --bbf=bb_swap.conf This allocates space on the burst buffer (generally pretty fast) to be used for swap space memory for users. Note that swap is quite a bit slower than standard (even main) memory, and so this option should be used with care. It is not, in principle, clever enough to guarantee each job space in the main memory, so as long as swap is being used on a node, all jobs on that node may be slowed down, depending on the memory profile and usage of the offending job. .. _`DataWarp configuration format`: https://pubs.cray.com/content/S-2558/CLE%206.0.UP05/xctm-series-datawarptm-user-guide/datawarp-job-script-command-examples