Module Descriptions

The functionality of dRep is broken up into modules. The user can run the modules separately, or together in workflows. For example, you could run:

$ dRep filter example_workD -g path/to/genomes*.fasta

$ dRep cluster example_workD

$ dRep analyze example_workD -pl a

OR:

$ dRep compare example_workD -g path/to/genomes*.fasta

There are two ways of doing the same thing. To see a list of available modules, check the help:

$ dRep -h

               ...::: dRep v2.0.0 :::...

 Choose one of the operations below for more detailed help.
 Example: dRep dereplicate -h

 Workflows:
   dereplicate  -> Combine several of the operations below to de-replicate a genome list
   compare      -> Simply compare a list of genomes

 Single operations:
   filter          -> Filter a genome list based on size, completeness, and/or contamination
   cluster         -> Compare and cluster a genome list based on MASH and ANIn/gANI
   choose          -> Choose the best genome from each genome cluster
   evaluate        -> Evaluate genome de-replication
   bonus           -> Other random operations (currently just determine taxonomy)
   analyze         -> Make figures related to the above operations; test alternative clustering

Work Directory

The work directory is where all of the program’s internal workings, log files, cached data, and output is stored. When running dRep modules multiple times on the same dataset, it is essential that you use the same work directory so the program can find the results of previous runs.

See also

Example Output
for help finding where the output from your run is located in the work directory
Advanced Use
for access to the raw internal data (which can be very useful)

Compare and Dereplicate

These are higher-level operations that call the modules below in succession.

Compare runs the modules:

  • cluster
  • bonus
  • evaluate
  • analyze

Dereplicate runs the modules:

  • filter
  • cluster
  • choose
  • bonus
  • evaluate
  • analyze

Filter

Filter is used filter the genome set (for why this is necessary, see Choosing parameters). This is done using checkM. All genomes which don’t pass the length threshold are filtered first to avoid running checkM unnecessarily. All genomes which don’t pass checkM thresholds are filtered before comparisons are run to avoid running comparisons unnecessarily.

Warning

All genomes must have at least one ORF called or else checkM will stall, so a length minimum of at least 10,000bp is recommended.

To see the command-line options, check the help:

$ dRep filter -h
usage: dRep filter [-p PROCESSORS] [-d] [-o] [-h] [-l LENGTH]
                   [-comp COMPLETENESS] [-con CONTAMINATION] [-str STRAIN_HTR]
                   [--skipCheckM] [-g [GENOMES [GENOMES ...]]] [--Chdb CHDB]
                   [--checkM_method {lineage_wf,taxonomy_wf}]
                   work_directory

positional arguments:
  work_directory        Directory where data and output
                        *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS ***

SYSTEM PARAMETERS:
  -p PROCESSORS, --processors PROCESSORS
                        threads (default: 6)
  -d, --dry             dry run- dont do anything (default: False)
  -o, --overwrite       overwrite existing data in work folder (default:
                        False)
  -h, --help            show this help message and exit

FILTERING OPTIONS:
  -l LENGTH, --length LENGTH
                        Minimum genome length (default: 500000)
  -comp COMPLETENESS, --completeness COMPLETENESS
                        Minumum genome completeness (default: 75)
  -con CONTAMINATION, --contamination CONTAMINATION
                        Maximum genome contamination (default: 25)
  -str STRAIN_HTR, --strain_htr STRAIN_HTR
                        Maximum strain heterogeneity (default: 25)
  --skipCheckM          Don't run checkM- will ignore con and comp settings
                        (default: False)

I/O PARAMETERS:
  -g [GENOMES [GENOMES ...]], --genomes [GENOMES [GENOMES ...]]
                        genomes to filter in .fasta format. Not necessary if
                        Bdb or Wdb already exist (default: None)
  --Chdb CHDB           checkM run already completed. Must be in --tab_table
                        format. (default: None)
  --checkM_method {lineage_wf,taxonomy_wf}
                        Either lineage_wf (more accurate) or taxonomy_wf
                        (faster) (default: lineage_wf)

Cluster

Cluster is the module that does the actual primary and secondary comparisons. Choosing parameters here can get a bit complicated- see Choosing parameters for information.

To see the command-line options, check the help:

$ dRep cluster -h
usage: dRep cluster [-p PROCESSORS] [-d] [-o] [-h] [-ms MASH_SKETCH]
                    [-pa P_ANI] [--S_algorithm {ANIn,gANI}] [-sa S_ANI]
                    [-nc COV_THRESH] [-n_PRESET {normal,tight}]
                    [--clusterAlg CLUSTERALG] [--SkipMash] [--SkipSecondary]
                    [-g [GENOMES [GENOMES ...]]]
                    work_directory

positional arguments:
  work_directory        Directory where data and output
                        *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS ***

SYSTEM PARAMETERS:
  -p PROCESSORS, --processors PROCESSORS
                        threads (default: 6)
  -d, --dry             dry run- dont do anything (default: False)
  -o, --overwrite       overwrite existing data in work folder (default:
                        False)
  -h, --help            show this help message and exit

CLUSTERING PARAMETERS:
  -ms MASH_SKETCH, --MASH_sketch MASH_SKETCH
                        MASH sketch size (default: 1000)
  -pa P_ANI, --P_ani P_ANI
                        ANI threshold to form primary (MASH) clusters
                        (default: 0.9)
  --S_algorithm {ANIn,gANI}
                        Algorithm for secondary clustering comaprisons
                        (default: ANIn)
  -sa S_ANI, --S_ani S_ANI
                        ANI threshold to form secondary clusters (default:
                        0.99)
  -nc COV_THRESH, --cov_thresh COV_THRESH
                        Minmum level of overlap between genomes when doing
                        secondary comparisons (default: 0.1)
  -n_PRESET {normal,tight}
                        Presents to pass to nucmer
                        tight   = only align highly conserved regions
                        normal  = default ANIn parameters (default: normal)
  --clusterAlg CLUSTERALG
                        Algorithm used to cluster genomes (passed to
                        scipy.cluster.hierarchy.linkage (default: average)
  --SkipMash            Skip MASH clustering, just do secondary clustering on
                        all genomes (default: False)
  --SkipSecondary       Skip secondary clustering, just perform MASH
                        clustering (default: False)

I/O PARAMETERS:
  -g [GENOMES [GENOMES ...]], --genomes [GENOMES [GENOMES ...]]
                        genomes to cluster in .fasta format. Not necessary if
                        already loaded sequences with the "filter" operation
                        (default: None)

Choose

Choose is the module that picks the best genome from each secondary cluster identified in Cluster. It does this based off of the formula:

\[score = A(completeness) – B(contamination) + C(Contamination * (strain_heterogeneity/100)) + D(log(N50)) + E(log(size))\]

Where A-E are command-line arguments, and the genome with the highest score is the “best”. By default, A-E are 1,5,1,0.5,0 respectively.

To see the command-line options, check the help:

$ dRep choose -h
usage: dRep choose [-p PROCESSORS] [-d] [-o] [-h] [-comW COMPLETENESS_WEIGHT]
                   [-conW CONTAMINATION_WEIGHT]
                   [-strW STRAIN_HETEROGENEITY_WEIGHT] [-N50W N50_WEIGHT]
                   [-sizeW SIZE_WEIGHT]
                   [--checkM_method {taxonomy_wf,lineage_wf}]
                   work_directory

positional arguments:
  work_directory        Directory where data and output
                        *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS ***

SYSTEM PARAMETERS:
  -p PROCESSORS, --processors PROCESSORS
                        threads (default: 6)
  -d, --dry             dry run- dont do anything (default: False)
  -o, --overwrite       overwrite existing data in work folder (default:
                        False)
  -h, --help            show this help message and exit

SCORING CRITERIA
Based off of the formula:
A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size)

A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight:
  -comW COMPLETENESS_WEIGHT, --completeness_weight COMPLETENESS_WEIGHT
                        completeness weight (default: 1)
  -conW CONTAMINATION_WEIGHT, --contamination_weight CONTAMINATION_WEIGHT
                        contamination weight (default: 5)
  -strW STRAIN_HETEROGENEITY_WEIGHT, --strain_heterogeneity_weight STRAIN_HETEROGENEITY_WEIGHT
                        strain heterogeneity weight (default: 1)
  -N50W N50_WEIGHT, --N50_weight N50_WEIGHT
                        weight of log(genome N50) (default: 0.5)
  -sizeW SIZE_WEIGHT, --size_weight SIZE_WEIGHT
                        weight of log(genome size) (default: 0)

OTHER:
  --checkM_method {taxonomy_wf,lineage_wf}
                        Either lineage_wf (more accurate) or taxonomy_wf
                        (faster) (default: lineage_wf)

Analyze

Analyze is the module that makes all of the figures.

To see the command-line options, check the help:

$ dRep analyze -h
usage: dRep analyze [-p PROCESSORS] [-d] [-o] [-h] [-c CLUSTER] [-t THRESHOLD]
                    [-m {ANIn,gANI}] [-mc MINIMUM_COVERAGE]
                    [-a {complete,average,single,weighted}]
                    [-pl [PLOTS [PLOTS ...]]]
                    work_directory

positional arguments:
  work_directory        Directory where data and output
                        *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS ***

SYSTEM PARAMETERS:
  -p PROCESSORS, --processors PROCESSORS
                        threads (default: 6)
  -d, --dry             dry run- dont do anything (default: False)
  -o, --overwrite       overwrite existing data in work folder (default:
                        False)
  -h, --help            show this help message and exit

PLOTTING:
  -pl [PLOTS [PLOTS ...]], --plots [PLOTS [PLOTS ...]]
                        Plots. Input 'all' or 'a' to plot all
                        1) Primary clustering dendrogram
                        2) Secondary clustering dendrograms
                        3) Secondary clusters heatmaps
                        4) Comparison scatterplots
                        5) Cluster scorring plot
                        6) Winning genomes
                         (default: None)

Evaluate

Evaluate performs a series of checks to alert the user to potential problems with de-replication. It has two things that it can look for:

de-replicated genome similarity- this is comparing all of the de-replicated genomes to each other and making sure they’re not too similar. This is to try and catch cases where similar genomes were split into different primary clusters, and thus failed to be de-replicated. Depending on the number of de-replicated genomes, this can take a while

secondary clusters that were almost different- this alerts you to cases where genomes are on the edge between being considered “same” or “different”, depending on the clustering parameters you used. This module reads the parameters you used during clustering from the work directory, so you don’t need to specify them again.

To see the command-line options, check the help:

$ dRep evaluate -h
usage: dRep evaluate [-p PROCESSORS] [-d] [-o] [-h] [--warn_dist WARN_DIST]
                     [--warn_sim WARN_SIM] [--warn_aln WARN_ALN]
                     [-e [EVALUATE [EVALUATE ...]]]
                     work_directory

positional arguments:
  work_directory        Directory where data and output
                        *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS ***

SYSTEM PARAMETERS:
  -p PROCESSORS, --processors PROCESSORS
                        threads (default: 6)
  -d, --dry             dry run- dont do anything (default: False)
  -o, --overwrite       overwrite existing data in work folder (default:
                        False)
  -h, --help            show this help message and exit

WARNINGS:
  --warn_dist WARN_DIST
                        How far from the threshold to throw cluster warnings
                        (default: 0.25)
  --warn_sim WARN_SIM   Similarity threshold for warnings between dereplicated
                        genomes (default: 0.98)
  --warn_aln WARN_ALN   Minimum aligned fraction for warnings between
                        dereplicated genomes (ANIn) (default: 0.25)

EVALUATIONS:
  -e [EVALUATE [EVALUATE ...]], --evaluate [EVALUATE [EVALUATE ...]]
                        Things to evaluate Input 'all' or 'a' to evaluate all
                        1) Evaluate de-replicated genome similarity
                        2) Throw warnings for clusters that were almost different
                        3) Generate a database of information on winning genomes
                         (default: None)

Bonus

Bonus consists of operations that don’t really fit in with the functions of dRep, but can be helpful. Currently the only thing it can do is determine taxonomy of your bins. This is done using centrifuge, similar to how anvi’o does it. If you choose to use this option, the taxonomy of genome will be shown with the filename in most figures.