Setup a Mini UNLOCK#

The following repositories and scripts have been created to create a local ‘mini’ UNLOCK environment. This allows you to more easily execute the Common Workflows developed by UNLOCK on your local instance. For this it is not needed to have a local iRODS instance but it is limited to the execution of workflows and thus there is no sophisticated data management available. If that is needed please contact us to discuss those options further.

Requirements#

  • Docker

  • (Fast) internet connection

  • Sufficient space (±1 terabyte)

Preparing your environment#

To prepare the environment we have developed a docker container that contains the major dependencies that are needed for workflow executions. These dependencies such as a Workflow execution program, dependency managers such as conda and more generic dependencies such as python, java, curl, wget, etc… which might not always be avaialble on your server.

  • Access the server you want to execute the workflows on.

  • To prepare your environment, start the unlock docker using the following command

    • Make sure you change STORAGE_DIR to a location of your choice (e.g. ~/unlock/).

docker run -v /STORAGE_DIR:/unlock docker-registry.wur.nl/unlock/docker:cwl /scripts/setup_workflow.sh

For example:

docker run -v ~/unlock:/unlock docker-registry.wur.nl/unlock/docker:cwl /scripts/setup_workflow.sh
Cloning into '/cwl/setup.tmp'...
#############################################
 setup file not found!
The following setup files are available:
all
workflow_illumina_quality
workflow_metagenomics_binning
workflow_nanopore_assembly
workflow_nanopore_quality
#############################################

As you can see there are multiple workflows you can prepare all dependencies and databases for. To prepare your environment for a specific workflow (e.g. workflow_nanopore_assembly) use the following command.

docker run -v ~/unlock:/unlock docker-registry.wur.nl/unlock/docker:cwl /scripts/setup_workflow.sh workflow_nanopore_assembly

Cloning into '/cwl/setup.tmp'...
Transferring file `a_sample_mt.sh'
Transferring file `addadapters.sh'
Transferring file ...

As you can see it will start to transfer the files from the UNLOCK facility to your server. Go get some coffee as this might take a while…

After the transfer of the binaries it will work on preparing virtual environments for different applications. This might take another while…

When it is finished with the installation your local mounted folder should contain all the files needed to execute the workflow you selected.

Workflow execution#

For the following example we will be working with a demo dataset from CAMISIM hybrid time series data (https://zenodo.org/record/5155395#.Ysabyy8Rr0o). We will only be using timepoint 0 in this tutorial.

Please download the following files while you are working on the next section.

sample G0_T0 group 0 short_reads_1 https://zenodo.org/record/5155395/files/CAMISIM_Illumina_R1.anonymous.G0_T0.fq.gz short_reads_2 https://zenodo.org/record/5155395/files/CAMISIM_Illumina_R2.anonymous.G0_T0.fq.gz long_reads https://zenodo.org/record/5155395/files/CAMISIM_Nanopore.G0_T0.fq.gz

Understanding the execution process#

To be able to execute a workflow directly we have created an execution script. This approach requires a path to the CWL file a yaml file and if you want to preserve the workflow provenance.

To start the execution process the following command is advised:

docker run -v ~/unlock:/unlock -v ~/my_data:/data -v ~/my_output:/output docker-registry.wur.nl/unlock/docker:cwl execute.sh

As you can see instead of 1 volume mount we have 3. All volume mounts can point to the same folder in the docker container but there are scenarios that you have your input files on 1 location and you want to store the output files somewhere else.

Upon execution you should get the following message:

Usage: /usr/bin/execute.sh [-c <cwl file path>] [-y <yaml file path>] [-p <true|false>]

In the previous step we prepared the environment for the nanopore workflow so lets use that for this example.

The workflows are located in LOCAL_DIR/infrastructure/cwl/workflows/ but as you mount the LOCAL_DIR into docker the path will become /unlock/infrastructure/cwl/workflows/. There you can find the workflow_nanopore_assembly.cwl file.

-c /unlock/infrastructure/cwl/workflows/workflow_nanopore_assembly.cwl

The workflow requires input parameters and it is best to provide this through a yaml file. Create an empty file (e.g. CAMISIM_G0_T0.yaml) in your data directory so the docker instance can access it upon execution.

We always like to have provenance of the workflow so we will also set the -p to true

-c /unlock/infrastructure/cwl/workflows/workflow_nanopore_assembly.cwl
-y /data/CAMISIM_G0_T0.yaml
-p true
docker run -v ~/unlock:/unlock -v ~/my_data:/data -v ~/my_output:/output docker-registry.wur.nl/unlock/docker:cwl execute.sh -c /unlock/infrastructure/cwl/workflows/workflow_nanopore_assembly.cwl -y /data/CAMISIM_G0_T0.yaml -p true

An error will be thrown immediatly…

c = /unlock/infrastructure/cwl/workflows/workflow_nanopore_assembly.cwl
y = /data/CAMISIM_G0_T0.yaml
p = true

Please specify a 'destination' in your yaml file
For example: destination: /output/my_workflow_run

There is no destination defined in your yaml file so cwltool does not know where to place the output files in. Add the destination to your yaml file destination: /output/my_workflow_run and it should start the workflow.

It might take a while to load as cwltool will check if the workflow is intact and all the connections between all the steps are coupled properly. The warnings shown can be ignored for now.

There is an error thrown due to a missing basecall model variable that is obligatory.

ERROR Workflow error, try again with --debug for more information:
Invalid job input record:
unlock/infrastructure/cwl/workflows/workflow_nanopore_assembly.cwl:122:3: Missing required input
                                                                          parameter 'basecall_model'

There are also other parameters needed such as the input files, identifier and references. A complete job file using the tutorial files at the beginning of this section looks like this:

destination: /output/my_workflow_run
basecall_model: r941_min_high
filter_references: 
 - /unlock/references/genomes/GCA_000/GCA_000001/GCA_000001405.28/GCA_000001405.28.fasta.gz
identifier: CAMISIM_GO_TO
threads: 25
memory: 25000
nanopore_reads:
 - /data/CAMISIM_Nanopore.G0_T0.fq.gz
illumina_forward_reads:
 - /data/CAMISIM_Illumina_R1.anonymous.G0_T0.fq.gz
illumina_reverse_reads:
 - /data/CAMISIM_Illumina_R2.anonymous.G0_T0.fq.gz

Here we have defined the destination of the workflow output, its basecall model needed for nanopore sequencing data. What reference(s) to use for contamination filtering, the identifier to make it more easy to recognise the output results (usefull when running performing many different runs), cpu and memory requirements and last but not least the intput file(s). In this case we used a combination of nanopore and illumina (used for binning).

Upon execution of this workflow using

docker run -v ~/unlock:/unlock -v ~/unlock_data:/data -v ~/unlock_output:/output docker-registry.wur.nl/unlock/docker:cwl execute.sh -c /unlock/infrastructure/cwl/workflows/workflow_nanopore_assembly.cwl -y /data/CAMISIM_G0_T0.yaml -p true

INFO [workflow ] start
INFO [workflow ] starting step workflow_quality_nanopore
INFO [step workflow_quality_nanopore] start
INFO [workflow workflow_quality_nanopore] start
...

This might take a while to finish…