How to setup and use the UNLOCK computional workflows.
A CWL runner
Depending on your workflow and inputs, a (powerful) server
For our workflows we use the reference runner cwltool. There are other runners listed here.We have not tested any other yet.
See common-workflow-language for installation instructions of cwltool.
UNLOCK CWL git repository:#
Clone our full CWL workflow repository:
git clone https://gitlab.com/m-unlock/cwl.git
Understanding the CWL file structure#
Inside the repository there is a folder called “workflows” these are multistepped workflows combining the tools (in the tools folder) and are the workflows we publish in the workflowhub. For example the one we are going to test here: https://workflowhub.eu/workflows/367
Note: the structure of the folders should NOT be changed because the workflow internally call the cwl files by relative location.
Try to start one of the workflows:
cwltool cwl/workflows/workflow_metagenomic_assembly.cwl --help
(You can get a bunch of warnings, which can be ignored)
This will list all the possible input options of the workflow. A more detailed explanation of the workflows and there use cases will be available soon.
While you can directly start a workflow run with it’s input arguments, it might be more reuseable and readable to create a YAML file where you can define your inputs. More on this in the next section.
Executing a CWL workflow#
For a test run you can execute the workflow cwl/workflows/workflow_metagenomics_assembly.cwl with a test YAML file. We will use a YAML file that is present in the test folder: tests/assembly/hybrid_small.yaml It’s content:
identifier: hybrid_TEST threads: 4 memory: 4000 run_spades: true run_flye: true binning: false metagenome: true keep_filtered_reads: true run_medaka: true ont_basecall_model: r941_min_hac_g507 nanopore_reads: - class: File path: http://download.systemsbiology.nl/unlock/cwl/test_data/long_reads_high_depth.fastq.gz illumina_forward_reads: - class: File path: http://download.systemsbiology.nl/unlock/cwl/test_data/short_reads_1.fastq.gz illumina_reverse_reads: - class: File path: http://download.systemsbiology.nl/unlock/cwl/test_data/short_reads_2.fastq.gz filter_references: - class: File path: http://download.systemsbiology.nl/unlock/cwl/test_data/human_small.fa.gz
Identifier: Used for naming the output files and folders
memory: Used in tools that have a specific memory option. (So not a general limit)
threads: Number of CPU threads to use in tools that have a multithreading option.
run_spades: Run SPAdes assembler (hybrid)
run_flye: Run flye assembler
binning: When true, this will start the binning worklow
metagenome: The sample is a metagenome or not, will influence assembler behaviour
run_medaka: Run medaka ONT assembly polishing
ont_basecall_model: Needed for medaka
nanopore and illumina: Local path to read files or accessible via http (like this example). Can be multiple.
The order in which you write any of the parameters does not matter.
cwltool --outdir assembly_test cwl/workflows/workflow_metagenomics_assembly.cwl cwl/tests/assembly/hybrid_small.yaml
This is very tiny test dataset without any meaningfull output. It will create all the assembly files and read quality plots etc. in the directory defined in –outdir: assembly_test
(The test data comes from the Unicycler assembler repository: rrwick/Unicycler)
–cachedir . Hopefully not, but your workflow run can stop prematurely. You can use this function to keep all intermediate output and cwltool will try to continue from where it crashed when you start it again.
–tmpdir-prefix . To change the location where cwltool stores the tmp files. These can be quite large.
cwltool has the option to capture a lot provenance of a workflow that was run using the –provenance option. We use this by default in our UNLOCK infrastructure. We will go into in more detail later!