Condor

All of the PCs within the ARC run software from the Condor project, which allows spare CPU cycles to be used for other tasks, in a manner similar to the well known SETI@Home project. Full documentation on the Condor software can be found here. You should at least read the User Guide chapter before going any further!

Condor lets you split a massive job up into smaller chunks that can be run on multiple machines with ease. If you have a large number of models to run, you simply queue them all up and Condor decides which of the available machines is the best to run it on, starts the job, and notifies you when it's all done. All you need to do is tell Condor what needs to be run, and it works out the 'where' and 'when', which clearly massively reduces the administrative overhead. The software can run any task that will operate from a simple batch script, so tasks such as data reduction can be handled too.

There are, of course, a few catches:

  • In the first case, running Condor efficiently requires you to be organised...
  • Jobs get run when there is spare CPU time so you can't force something to run right away. However, in practice we have sufficient CPU power available to get most jobs done in a reasonable time frame. The software tries to ensure that all users have equitable use of resources averaged over time.
  • We're somewhat limited by network bandwidth, so it's not a good idea to try to access lots of files over the LAN - you'll only saturate your own machine. The best way to do that sort of job is to write a script to copy the files to, say, /tmp on the remote PC and operate on them there, and copy the results back when finished. As long as you keep to a few hundred MB of files you should be fine.
The following notes explain the specifics of the ARC setup.

The Machines

There are four groups of machines on the ARC LAN, and these divisions are distinguished by Condor via a custom ClassAd (see below). These groups are:
  • NLTE cluster - a group of Pentium 4 systems dedicated to Condor work. Nominally reserved for the non-LTE modelling effort.
  • WASP cluster - a set of mostly dual Xeon workstations mostly used for analysing SuperWASP data, also includes two single CPU machines.
  • Atomic cluster - a set of 12 Pentium 4 machines for use by the Atomic Physics team. These are firewalled from the rest of the LAN to allow for extended run times without reboots - these systems are not available to general users.
  • General cluster - all the rest of the machines on the LAN (excepting the servers). This is a mix of 1.2GHz Athlons and 2-3GHz Pentium 4s. Since these are people's regular workstations, jobs do not run continuously on them - they are only available outside of regular office hours (defined as 09:00-19:00, Mon-Fri), and jobs will automatically be suspended if the user begins to type at the keyboard. Jobs also run at low priority so as not to obstruct the workstation 'owners' own batch jobs.

Local ClassAds

Two additional Condor ClassAds are available on the ARC machines (not on the Atomic cluster though). The first of these will let you specify which cluster of machines the job is to be submitted to. The second allows you to specify if a job needs a Pentium 4 processor or not (if you have compiled your code with specific Pentium options then it won't run on the Athlons).
To specify the cluster you want, add a line like this to your Condor submission file:

requirements = (Cluster == "GENERAL")

In this case the submittion would be to the General Cluster. The other options are NLTE, and WASP, which should be self-explanatory. If you don't specify a cluster then the job is automatically routed by the Condor software, but this should not be done! Note that the cluster name is case-sensitive, and all cluster names are in CAPS.
If using the General cluster then you'll likely want to specify the processor type to use:

requirements = (CpuIsP4 == TRUE)

Which will only let the jobs execute on a Pentium 4 system. If the CPU type does not matter then omit this requirement entirely.
The requirements can be combined like this:

requirements = ((CpuIsP4 == TRUE) && (Cluster == "GENERAL"))

Installation

The Condor software runs automatically on all machines. All you need to do is to add the directory /home/condor/bin/ to your PATH environment variable.

Example Use of Condor

The standard use for Condor will be to explore a parameter space by running a model code for a grid of input values. In ARC, we use Condor in its Vanilla mode, which lets you submit shell scripts and (statically compiled) binaries for execution. Submitting jobs takes a little thought as to directory structures, etc, but this is not difficult and indeed enforces a useful discipline. The general pattern of a job submission will be like this:
  1. Prepare binaries, input files, and job shell script
  2. Prepare a Condor job submission script
  3. Submit the condor job using condor_submit
Considering these steps in more detail:
  1. Preparation of binaries, etc

    Compile your code as normal, remembering to choose a static compilation via the appropriate linker options. Generate your input files, and then write a simple shell script to run the code(s). When writing the latter you need to remember that while your file system structures are available all over the clusters, if you are doing a lot of disk accesses then it's best to have the script make a directory under /tmp, copy what files it needs to that directory, run the job(s), then copy the files out before deleting the temporary directory. Below is an outline of such a script:


    #!/bin/tcsh
    # script to run jobs on local condor machines.
    
    # print name of system we're on - useful to track problems
    
    /bin/hostname -s
    
    # make temp working dir
    # makes a directory called rsir_tltmp.xx where xx is the process ID (PID)
    # number, which is sufficiently random for this purpose
    
    set THEDIR = rsir_tmp.$$
    cd /tmp
    mkdir $THEDIR
    cd $THEDIR
    
    # make symlinks, copy files, etc, as needed.
    # define a ROOT directory for shorthand, assume most/all inputs files are
    # located in this directory
    
    set ROOT = /home/rsir/models/run_1234
    
    ln -s $ROOT/qubdata .
    ln -s $ROOT/moredata .
    
    cp $ROOT/code01.exe .
    cp $ROOT/code02.exe .
    cp $ROOT/input0[12].inp .
    
    # now run the codes
    # note no need to redirect output to file using > since condor will do this
    
    date # keep eye on how long this takes
    ./code01.exe
    
    # move output around for second part of run
    
    cp output01.dat fort.18
    
    date
    ./code02.exe
    
    date # print end date
    
    # copy files back to the root directory
    
    cp output01.dat output02.dat $ROOT
    cd ..
    rm -rf $THEDIR         # all gone!
    

    This is a very simple script, and in principle any shell script can be submitted to Condor. My own scripts typically have parameters passed into them to allow me to specify models that way, rather than having to code each script by hand. Here is the beginning of one of my own scripts which uses this approach.


    #!/bin/tcsh
    # script to run TLUSTY jobs on local condor machines.
    # pass three params
    # root directory, model root names, nonstandard param filename
    
    # print name of system we're on
    
    /bin/hostname -s
    
    # get names of model root and non-standard params file
    
    set ROOT = $1
    set MODS = $2
    set NSTD = $3
    

  2. The condor submission script

    The Condor submission scripts are quite straightforward, and tell Condor what job to run, where to put any output, and optionally to add any specific constraints on what machines should run the job (eg. only machines with 2GB RAM, etc). A review of the Condor Manual is recommended! A simple condor submission script looks like this:


    ####################
    ##
    ## Simple TLUSTY condor file
    ##
    ####################
    
    universe        = vanilla
    executable      = /home/rsir/models/run_1234/runjob.tcsh
    output          = 000.output
    error           = 000.errors
    log             = 000.log
    initialdir      = /home/rsir/models/run_1234/
    notification    = Error
    
    queue
    

    What this all means:

    • The first line tells Condor that this is a regular executable, not one compiled using the special Condor libraries. Just leave this in your run files unchanged.
    • The second line points to the shell script that runs your job. It's best to specify the full path, though a relative path should also work.
    • The next three lines tell Condor where to put 'screen' output from the runs. This means there is no need to redirect output within the script itself, just let Condor deal with it, and then you can monitor the progress of your job by reading these files. If you don't specify a full path here they are placed in the same directory as the run file - this is a sensible default. The output and error files trap output normally sent to STDOUT and STDERR. The log file contains output from Condor itself.
    • The initialdir argument tells Condor where to 'start' the job from. It's sensible to make this the same as the directory all your other files are in. This paramter is not mandatory, but probably best explicitly declared for safety.
    • Finally the notification line tells Condor when to mail you status updates - by default this happens every time a job finishes, setting it like this only emails you if there was a problem. Note that email does not go to the QUB email servers, it stays in the local .mail folder, so you need to make arrangements to check that as needed - KMail will handle this quite easily.

  3. Submitting the job

    When all is ready, you simply submit the job to Condor using the command condor_submit myrun.cmd where myrun.cmd is the name of the Condor command file (which can be called anything you like, but .cmd is a conventional suffix). A full discussion of use of Condor is beyond this guide, and since the full manual is installed on the website you are directed there for more information. Chapter 2 - the User Guide - is most helpful. However, a quick list of commands that may be of use:

    • condor_submit - submit a job
    • condor_status - show the status of the entire cluster
    • condor_queue - show the list of all your jobs
    • condor_rm - delete a job

Last updated Tuesday January 17, 2006