Configuration File

uap operates on YAML files which define data analysis. These files are called configuration files.

A configuration file describes a analysis completely. Configurations consist of four sections (let’s just call them sections, although technically, they are keys):

  • destination_path – points to the directory where the result files, annotations and temporary files are written to
  • email – when submitting jobs on a cluster, messages will be sent to this email address by the cluster engine (nobody@example.com by default)
  • constants – defines constants for later use (define repeatedly used values as constants to increase readability of the following sections)
  • steps – defines the source and processing steps and their order
  • tools – defines all tools used in the analysis and how to determine their versions (for later reference)

If you want to know more about the notation that is used in this file, have a closer look at the YAML definition.

Sections of a Configuration File

Destination_path Section

The value of destination_path is the directory where uap is going to store the created files.

destination_path: "/path/to/uap/output"

Email Section

The value of email is needed if the analysis is executed on a cluster, which can use it to inform the person who started uap about status changes of submitted jobs.

email: "your.name@mail.de"

Steps Section

The steps section is the core of the analysis file, because it defines when steps are executed and how they depend on each other. All available steps are described in detail in the steps documentation: Available steps. This section contains a key for every step, therefore each step must have a unique name [1]. There are two ways to name a step to allow multiple steps of the same type and still ensure unique naming:

steps:
    # here, the step name is unchanged, it's a cutadapt step which is also
    # called 'cutadapt'
    cutadapt:
        ... # options following

    # here, we also insert a cutadapt step, but we give it a different name:
    # 'clip_adapters'
    clip_adapters (cutadapt):
        ... # options following

There are two different types of steps:

Source Steps

They provide input files for the analysis. They might start processes such as downloading files or demultiplexing sequence reads. But, they do not have dependencies, they can introduce files from outside the destination path (see Destination_path Section), and they are usually the first steps of an analysis.

For example if you want to work with fastq files, the first step is to import the required files. For this task the source step fastq_source is the right solution.

A possible step definition could look like this:

steps:
    input_step (fastq_source):
    pattern: /Path/to/fastq-files/*.gz
    group: ([SL]\w+)_R[12]-00[12].fastq.gz
    sample_id_prefix: MyPrefix
    first_read: '_R1'
    second_read: '_R2'
    paired_end: True

The single keys will be described at Available steps. For defining the group key a regular expression is used. If you are not familiar with this you can read about it and test your regular expression at pythex.org.

Processing Steps

They depend upon one or more predecessor steps and work with their output files. Output files of processing steps are automatically named and placed by uap. Processing steps are usually configurable. For a complete list of available options please visit Available steps or use the subcommand steps.

Reserved Keywords for Steps

_depends:
Dependencies are defined via the _depends key which may either be null, a step name, or a list of step names.
steps:
    # the source step which depends on nothing
    fastq_source:
        # ...

    run_folder_source:
        # ...

    # the first processing step, which depends on the source step
    cutadapt:
        _depends: [fastq_source, run_folder_source]

    # the second processing step, which depends on the cutadapt step
    fix_cutadapt:
        _depends: cutadapt
_connect:
Normally steps connected with _depends do pass data along by defining so called connections. If the name of an output connection matches the name of an input connection of its succeeding step data gets passed on automatically. But, sometimes the user wants to force the connection of differently named connections. This can be done with the _connect keyword. A common usage is to connect downloaded data with a Processing Steps.
  steps:
      # Source step to download i.e. sequence of chr1 of some species
      chr1 (raw_url_source):
          ...

      # Download chr2 sequence
      chr2 (raw_url_source):
          ...

      merge_fasta_files:
          _depends:
              - chr1
              - chr2
          # Equivalent to:
          # _depends: [chr1, chr2]
          _connect:
              in/sequence:
                  - chr1/raw
                  - chr2/raw
          # Equivalent to:
          # _connect:
          #     in/sequence: [chr1/raw, chr2/raw]

The examples shows how the ``raw_url_source`` output connection ``raw`` is
connected to the input connection ``sequence`` of the ``merge_fasta_files``
step.
_BREAK:
If you want to cut off entire branches of the step graph, set the _BREAK flag in a step definition, which will force the step to produce no runs (which will in turn give all following steps nothing to do, thereby effectively disabling these steps):
steps:
    fastq_source:
        # ...

    cutadapt:
        _depends: fastq_source

    # this step and all following steps will not be executed
    fix_cutadapt:
        _depends: cutadapt
        _BREAK: true
_volatile:
Steps can be marked with _volatile: yes. This flag tells uap that the output files of the marked step are only intermediate results.
steps:
    # the source step which depends on nothing
    fastq_source:
        # ...

    # this steps output can be deleted if all depending steps are finished
    cutadapt:
        _depends: fastq_source
        _volatile: yes
        # same as:
        # _volatile: True

    # if fix_cutadapt is finished the output files of cutadapt can be
    # volatilized
    fix_cutadapt:
        _depends: cutadapt

If all steps depending on the intermediate step are finished uap tells the user that he can free disk space. The message is output if the status is checked and looks like this:

Hint: You could save 156.9 GB of disk space by volatilizing 104 output files.
Call 'uap <project-config>.yaml volatilize --srsly' to purge the files.

If the user executes the volatilize command the output files are replaced by placeholder files.

Tools Section

The tools section must list all programs required for the execution of a particular analysis. uap uses the information given here to check if a tool is available given the current environment. This is particularly useful on cluster systems were software might not always be loaded. Also, uap logs the version of each tool used by a step.

By default, version determination is simply attempted by calling the program without command-line arguments.

If a certain argument is required, specify it in get_version.

If a tool does not exit with code 0, you can find out which code is it. Execute the required command and after this type echo $? in the same shell. The output is the exit code of the last executed command. You can use it to specify the exit code in exit_code.

tools:
    # you don't have to specify a path if the tool can be found in $PATH
    cat:
        path: cat
        get_version: --version
    # you have to specify a path if the tool can not be found in $PATH
    some-tool:
        path: /path/to/some-tool
        get_version: --version

If you are working on a cluster running UGE or SLURM you can also use their module system. You need to know what actually happens when you load or unload a module:

$ module load <module-name>
$ module unload <module-name>

As far as I know is module neither a command nor an alias. It is a BASH function. So use declare -f to find out what it is actually doing:

$ declare -f module

The output should look like this:

module ()
    {
        eval `/usr/local/modules/3.2.10-1/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
    }

An other possible output is:

module ()
    {
        eval $($LMOD_CMD bash "$@");
        [ $? = 0 ] && eval $(${LMOD_SETTARG_CMD:-:} -s sh)
    }

In this case you have to look in $LMOD_CMD for the required path:

$ echo $LMOD_CMD

Now you can use this newly gathered information to load a module before use and unload it afterwards. You only need to replace $MODULE_VERSION with the current version of the module system you are using and bash with python. A potential bedtools entry in the tools section, might look like this.

tools:
    ....
    bedtools:
        module_load: /usr/local/modules/3.2.10-1/Modules/3.2.10/bin/modulecmd python load bedtools/2.24.0-1
        module_unload: /usr/local/modules/3.2.10-1/Modules/3.2.10/bin/modulecmd python unload bedtools/2.24.0-1
        path: bedtools
        get_version: --version
        exit_code: 0

Note

Use python instead of bash for loading modules via uap. Because the module is loaded from within a python environment and not within a BASH shell.

Example Configurations

Please check out the example configurations provided inside the example-configurations folder of uap‘s installation directory.

[1]PyYAML does not complain about duplicate keys