Running at NERSC
================
Tips and tricks for using the PyTaskFarmer on NERSC machines (ie: cori).

You can use PyTaskFarmer a part of your top-level batch script for submissions
into the NERSC slurm batch system. There are a variety of examples for running
multi-core or multi-node jobs `available here`_.

.. _`available here`: https://docs.nersc.gov/jobs/examples/

Equalize Task Running Time
--------------------------

The farmer likes to have more work than workers, in order to keep those workers
busy at all times. That means if you have tasks that might be different lengths
(e.g. MC and data, or different size datasets, etc), it is very important to 

1. put the longer tasks earlier in the list,
2. have a total run time that is longer than the longest job (preferably by a
   factor of 2 or more) and
3. request a number of cores that will be kept busy by your jobs.

For example, if you expect to have one 1-hour job and ten 5-minute jobs, you can
requests two threads; one thread will process the 1-hour job and the other
thread will process all the 5-minute jobs. This relies on your ordering the task
list well -- if you make the 1-hour job last, then the two threads will work
through all your 5-minute jobs in about 25 minutes and then one will process the
1-hour job while the other sit idle (and wastes CPU). This requires some thought
and care, but can save us significant numbers of hours, so please do think
carefully about what you're running!

Clean-up In Batch Jobs
----------------------
The farmer can be used in any queue at NERSC. One of the better options if some
work needs doing but is not urgent is to use the flex queue on KNL. When
submitting into that queue, one must add
:code:`--time-min=01:30:00 --time=10:00:00`, where the first is the minimum time
that the farmer should be run (cannot be not be longer than 2 hours) and should
be longer than a typical command you need to execute. The second is the total
wall time for the job, which must be less than 12 hours. Jobs in this queue will
be started, killed, and restarted from checkpoints.

Add to your job script

.. code-block:: sh

    # requeueing the job if reamining time >0 (do not change the following 3 lines )
    . /usr/common/software/variable-time-job/setup.sh
    requeue_job func_trap USR1

in order to have the job automatically re-queued so that it will continue to
run. You should also add to your run script

.. code-block:: sh

    #SBATCH --signal=B:USR1@10

To give the job 10 seconds to handle the USR1 signal (it should not need that
long, but in case there are multiple workers fighing for the same lock). For the
check-pointing, please also add these to your job script:

.. code-block:: sh

    # use the following three variables to specify the time limit per job (max_timelimit), 
    # the amount of time (in seconds) needed for checkpointing, 
    # and the command to use to do the checkpointing if any (leave blank if none)
    max_timelimit=12:00:00   # can match the #SBATCH --time option but don't have to
    ckpt_overhead=60         # should match the time in the #SBATCH --signal option
    ckpt_command=

Note that these are in addition to the usual sbatch specifications, and it is
quite important that they match.

Extra Memory
------------

If you have serious memory issues, then it is possible to enable swap space when
running in a full node queue (e.g. regular; this is not possible in the shared
queue). To do so, make a burst-buffer config file like:

.. code-block:: sh

    $ cat bb_swap.conf
    #DW jobdw capacity=160GB access_mode=striped type=scratch
    #DW swap 150GB

This uses the Cray `DataWarp configuration format`_. The second line is the
important one here; it provides 150 GB of swap space within the burst buffer.
The first line describes the scratch space reservation that your job needs, and
may be unnecessary or even problematic depending on where you write your inputs
and outputs for the job (think about what it's doing before sending the command
off to the queue). You can then add it to your job submission like:

.. code-block:: sh

    salloc ... --bbf=bb_swap.conf

This allocates space on the burst buffer (generally pretty fast) to be used for
swap space memory for users. Note that swap is quite a bit slower than standard
(even main) memory, and so this option should be used with care. It is not, in
principle, clever enough to guarantee each job space in the main memory, so as
long as swap is being used on a node, all jobs on that node may be slowed down,
depending on the memory profile and usage of the offending job.


.. _`DataWarp configuration format`: https://pubs.cray.com/content/S-2558/CLE%206.0.UP05/xctm-series-datawarptm-user-guide/datawarp-job-script-command-examples