Software Design

uap is designed as a plugin architecture. The plugins are internally called steps. Two different types of steps exist, the source and processing steps. Source steps are used to include data from outside the destination path (see Destination_path Section) into the analysis. Processing steps are blueprints. Each step corresponds to the blueprint of a single data processing where uap itself controls the ordered execution of the plugged in so called steps. Steps are organized in a dependency graph (a directed acyclic graph) – every step may have one or more parent steps, which may in turn have other parent steps, and so on. Steps without parents are usually sources which provide source files, for example FASTQ files with the raw sequences obtained from the sequencer, genome sequence databases or annotation tracks.

Each step defines a number of runs and each run represents a piece of the entire data analysis, typically at the level of a single sample. A certain run of a certain step is called a task. While the steps only describe what needs to be done on a very abstract level, it is through the individual runs of each step that a uap wide list of actual tasks becomes available. Each run may provide a number of output files which depend on output files of one or several runs from parent steps.

To make the relationship between tasks, steps and runs more clear, we look at one example from a configuration file:

The status request output of

uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml status

is

Waiting tasks
-------------
[w] bowtie2_index/Mycoplasma_genitalium_index-download
[w] bwa_index/Mycoplasma_genitalium_index-download
[w] fasta_index/download
[w] segemehl_index/Mycoplasma_genitalium_genome-download

Ready tasks
-----------
[r] M_genitalium_genome/download

 tasks: 5 total, 4 waiting, 1 ready

Here are 5 tasks listed. The first one is ‘’bowtie2_index/Mycoplasma_genitalium_index-download’‘. The first part is the step ‘’bowtie2_index’’ which is defined in the configuration file. The second part is the specific run ‘’Mycoplasma_genitalium_index-download’‘.

Source steps define a run for every input sample, and a subsequent step may:

  • define the same number of runs,
  • define more runs (for example when R1 and R2 reads in a paired-end RNASeq experiment should be treated separately),
  • define fewer runs (usually towards the end of a pipeline, where results are summarized).