This guide provides instructions on how to run the Metacompass.
Before running the Metacompass, ensure that you have the following:
- Nextflow (setting it up using our conda environment is recommended)
- Reference database (
ref_db_path) (see build instructions here) - Input forward and reverse read files in FASTQ format
- Other requirements
To run Metacompass effectively, we recommend creating a script that incorporates the following key elements:
-
Input Data Paths: Specify the paths to the input forward and reverse read files.
-
Parameter Configuration: Define various parameters, including:
- The reference database
- The output directory
- The number of threads to use
Creating such a script will simplify the execution of Metacompass for your specific analysis.
# Set the paths to your input data and reference database by modifying
# the following variables in your shell script:
forward_read="forwad_read.fastq.gz" # will also work with .fastq file
reverse_read="reverse_read.fastq.gz" # will also work with .fastq file
ref_db_path="/path/to/your/reference/database"
output_folder="/path/to/your/output/directory"
# Run metacompass on these variables using the following command:
nextflow run metacompass2.nf \
--reference_db "${ref_db_path}" \ # [required]
--forward "$forward_read" \ # [required]
--reverse "$reverse_read" \ # [required]
-output-dir "$output_folder" \ # [required]
--threads 8 \ # [optional] by default it is 16
--trace_file_name "$output_folder/trace.txt" \ # [optional]
-with-timeline "$output_folder/timeline.html" \ # [optional]
-with-dag "$output_folder/${read}_dag.png" # [optional]
# --trace_file_name: Path to a nextflow trace file
# -with-timeline: Generates a timeline HTML report. [optional]
# -with-dag: Generates a Directed Acyclic Graph (DAG) visualization. [optional]./example.shMonitor the progress and view the results in the specified output directory.
By default, MetaCompass also generates a de novo assembly using the reads that were not used by the reference-guided step. To turn off this feature, you can use the parameter --de_novo 0 or you can turn it on with --de_novo 1 .
There will be several sub-folders within the main output folder:
reference_selection/- this is where the outputs from the reference selection will be located. Notably, the fileref_genome_marker_gene_coverage.tsvcontains a list of all the genomes that have hits to at least one marker gene and provides information about the depth and breadth of coverage for the full marker gene set. This file can be used to determine why a genome you expect to be found in a sample was not actually selected by MetaCompass.reference_culling/- this is where you see the outputs from the step that clusters together the genomes selected in the reference selection step . This folder should contain a separate.fnafile for each cluster as well as several files namedclusters.*that include information about the membership of each genome clusterreference_assembly/- this folder contains the assemblies generated for each genome in each cluster. The folder should contain several sub-folders, one per genome, containing the assembled sequence for that reference genome sequence.denovo_assembly/- this folder contains the output of the de novo assembler (if run). Currently it should only contain one folder,megahit_outcorresponding to the output of MEGAHIToutput/- this folder contains the key outputs from MetaCompass - the separate assemblies of each genome, the de novo assembly, as well as a file calledall_contigs.fnacontaining all the contigs generated by both the reference-guided and de novo assemblies.