User manual

dRep has 3 commands: compare, dereplicate, and check dependencies. To see a list of these options check the help:

$ dRep -h

              ...::: dRep v3.0.0 :::...

Matt Olm. MIT License. Banfield Lab, UC Berkeley. 2017 (last updated 2020)

See https://drep.readthedocs.io/en/latest/index.html for documentation
Choose one of the operations below for more detailed help.

Example: dRep dereplicate -h

Commands:
  compare            -> Compare and cluster a set of genomes
  dereplicate        -> De-replicate a set of genomes
  check_dependencies -> Check which dependencies are properly installed

In previous versions of dRep (everything before v3) the user could run a number of additional modules separately, but now they can only be run as part of the larger workflows compare and dereplicate. Many of the modules are the same for compare and dereplicate, however, and in cases where these is the same parameter in both it functions exactly the same in each.

dRep has descriptions in the program help for all the adjustable parameters. If any of these are particularly confusing, don’t hesitate to send an email to ask what it does.

See also

Important Concepts
for theoretical thoughts about how to choose appropriate parameters and thresholds
Example Output
for help interpreting the output from your run in the work directory
Advanced Use
for access to the raw output data and the python API

Compare

This workflow compares a set of genomes. For a list of all parameters, check the help:

$ dRep compare -h
  usage: dRep compare [-p PROCESSORS] [-d] [-h] [-g [GENOMES [GENOMES ...]]]
                      [--S_algorithm {fastANI,gANI,goANI,ANIn,ANImf}]
                      [-ms MASH_SKETCH] [--SkipMash] [--SkipSecondary]
                      [--n_PRESET {normal,tight}] [-pa P_ANI] [-sa S_ANI]
                      [-nc COV_THRESH] [-cm {total,larger}]
                      [--clusterAlg {median,weighted,single,complete,average,ward,centroid}]
                      [--multiround_primary_clustering]
                      [--primary_chunksize PRIMARY_CHUNKSIZE]
                      [--greedy_secondary_clustering]
                      [--run_tertiary_clustering] [--warn_dist WARN_DIST]
                      [--warn_sim WARN_SIM] [--warn_aln WARN_ALN]
                      work_directory

  positional arguments:
    work_directory        Directory where data and output are stored
                          *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS ***

  SYSTEM PARAMETERS:
    -p PROCESSORS, --processors PROCESSORS
                          threads (default: 6)
    -d, --debug           make extra debugging output (default: False)
    -h, --help            show this help message and exit

  GENOME INPUT:
    -g [GENOMES [GENOMES ...]], --genomes [GENOMES [GENOMES ...]]
                          genomes to filter in .fasta format. Not necessary if
                          Bdb or Wdb already exist. Can also input a text file
                          with paths to genomes, which results in fewer OS
                          issues than wildcard expansion (default: None)

  GENOME COMPARISON OPTIONS:
    --S_algorithm {fastANI,gANI,goANI,ANIn,ANImf}
                          Algorithm for secondary clustering comaprisons:
                          fastANI = Kmer-based approach; very fast
                          ANImf   = (DEFAULT) Align whole genomes with nucmer; filter alignment; compare aligned regions
                          ANIn    = Align whole genomes with nucmer; compare aligned regions
                          gANI    = Identify and align ORFs; compare aligned ORFS
                          goANI   = Open source version of gANI; requires nsmimscan
                           (default: ANImf)
    -ms MASH_SKETCH, --MASH_sketch MASH_SKETCH
                          MASH sketch size (default: 1000)
    --SkipMash            Skip MASH clustering, just do secondary clustering on
                          all genomes (default: False)
    --SkipSecondary       Skip secondary clustering, just perform MASH
                          clustering (default: False)
    --n_PRESET {normal,tight}
                          Presets to pass to nucmer
                          tight   = only align highly conserved regions
                          normal  = default ANIn parameters (default: normal)

  GENOME CLUSTERING OPTIONS:
    -pa P_ANI, --P_ani P_ANI
                          ANI threshold to form primary (MASH) clusters
                          (default: 0.9)
    -sa S_ANI, --S_ani S_ANI
                          ANI threshold to form secondary clusters (default:
                          0.99)
    -nc COV_THRESH, --cov_thresh COV_THRESH
                          Minmum level of overlap between genomes when doing
                          secondary comparisons (default: 0.1)
    -cm {total,larger}, --coverage_method {total,larger}
                          Method to calculate coverage of an alignment
                          (for ANIn/ANImf only; gANI and fastANI can only do larger method)
                          total   = 2*(aligned length) / (sum of total genome lengths)
                          larger  = max((aligned length / genome 1), (aligned_length / genome2))
                           (default: larger)
    --clusterAlg {median,weighted,single,complete,average,ward,centroid}
                          Algorithm used to cluster genomes (passed to
                          scipy.cluster.hierarchy.linkage (default: average)

  GREEDY CLUSTERING OPTIONS
  These decrease RAM use and runtime at the expense of a minor loss in accuracy.
  Recommended when clustering 5000+ genomes:
    --multiround_primary_clustering
                          Cluster each primary clunk separately and merge at the
                          end with single linkage. Decreases RAM usage and
                          increases speed, and the cost of a minor loss in
                          precision and the inability to plot
                          primary_clustering_dendrograms. Especially helpful
                          when clustering 5000+ genomes. Will be done with
                          single linkage clustering (default: False)
    --primary_chunksize PRIMARY_CHUNKSIZE
                          Impacts multiround_primary_clustering. If you have
                          more than this many genomes, process them in chunks of
                          this size. (default: 5000)
    --greedy_secondary_clustering
                          Use a heuristic to avoid pair-wise comparisons when
                          doing secondary clustering. Will be done with single
                          linkage clustering. Only works for fastANI S_algorithm
                          option at the moment (default: False)
    --run_tertiary_clustering
                          Run an additional round of clustering on the final
                          genome set. This is especially useful when greedy
                          clustering is performed and/or to handle cases where
                          similar genomes end up in different primary clusters.
                          Only works with dereplicate, not compare. (default:
                          False)

  WARNINGS:
    --warn_dist WARN_DIST
                          How far from the threshold to throw cluster warnings
                          (default: 0.25)
    --warn_sim WARN_SIM   Similarity threshold for warnings between dereplicated
                          genomes (default: 0.98)
    --warn_aln WARN_ALN   Minimum aligned fraction for warnings between
                          dereplicated genomes (ANIn) (default: 0.25)

  Example: dRep compare output_dir/ -g /path/to/genomes/*.fasta

Dereplicate

This workflow dereplicates a set of genomes. For a list of all parameters, check the help:

$ dRep dereplicate -h
usage: dRep dereplicate [-p PROCESSORS] [-d] [-h] [-g [GENOMES [GENOMES ...]]]
                        [-l LENGTH] [-comp COMPLETENESS] [-con CONTAMINATION]
                        [--ignoreGenomeQuality] [--genomeInfo GENOMEINFO]
                        [--checkM_method {taxonomy_wf,lineage_wf}]
                        [--set_recursion SET_RECURSION]
                        [--S_algorithm {goANI,ANIn,gANI,ANImf,fastANI}]
                        [-ms MASH_SKETCH] [--SkipMash] [--SkipSecondary]
                        [--n_PRESET {normal,tight}] [-pa P_ANI] [-sa S_ANI]
                        [-nc COV_THRESH] [-cm {total,larger}]
                        [--clusterAlg {single,ward,complete,weighted,centroid,median,average}]
                        [--multiround_primary_clustering]
                        [--primary_chunksize PRIMARY_CHUNKSIZE]
                        [--greedy_secondary_clustering]
                        [--run_tertiary_clustering]
                        [-comW COMPLETENESS_WEIGHT]
                        [-conW CONTAMINATION_WEIGHT]
                        [-strW STRAIN_HETEROGENEITY_WEIGHT] [-N50W N50_WEIGHT]
                        [-sizeW SIZE_WEIGHT] [-centW CENTRALITY_WEIGHT]
                        [--warn_dist WARN_DIST] [--warn_sim WARN_SIM]
                        [--warn_aln WARN_ALN]
                        work_directory

positional arguments:
  work_directory        Directory where data and output are stored
                        *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS ***

SYSTEM PARAMETERS:
  -p PROCESSORS, --processors PROCESSORS
                        threads (default: 6)
  -d, --debug           make extra debugging output (default: False)
  -h, --help            show this help message and exit

GENOME INPUT:
  -g [GENOMES [GENOMES ...]], --genomes [GENOMES [GENOMES ...]]
                        genomes to filter in .fasta format. Not necessary if
                        Bdb or Wdb already exist. Can also input a text file
                        with paths to genomes, which results in fewer OS
                        issues than wildcard expansion (default: None)

GENOME FILTERING OPTIONS:
  -l LENGTH, --length LENGTH
                        Minimum genome length (default: 50000)
  -comp COMPLETENESS, --completeness COMPLETENESS
                        Minumum genome completeness (default: 75)
  -con CONTAMINATION, --contamination CONTAMINATION
                        Maximum genome contamination (default: 25)

GENOME QUALITY ASSESSMENT OPTIONS:
  --ignoreGenomeQuality
                        Don't run checkM or do any quality filtering. NOT
                        RECOMMENDED! This is useful for use with
                        bacteriophages or eukaryotes or things where checkM
                        scoring does not work. Will only choose genomes based
                        on length and N50 (default: False)
  --genomeInfo GENOMEINFO
                        location of .csv file containing quality information
                        on the genomes. Must contain: ["genome"(basename of
                        .fasta file of that genome), "completeness"(0-100
                        value for completeness of the genome),
                        "contamination"(0-100 value of the contamination of
                        the genome)] (default: None)
  --checkM_method {taxonomy_wf,lineage_wf}
                        Either lineage_wf (more accurate) or taxonomy_wf
                        (faster) (default: lineage_wf)
  --set_recursion SET_RECURSION
                        Increases the python recursion limit. NOT RECOMMENDED
                        unless checkM is crashing due to recursion issues.
                        Recommended to set to 2000 if needed, but setting this
                        could crash python (default: 0)

GENOME COMPARISON OPTIONS:
  --S_algorithm {goANI,ANIn,gANI,ANImf,fastANI}
                        Algorithm for secondary clustering comaprisons:
                        fastANI = Kmer-based approach; very fast
                        ANImf   = (DEFAULT) Align whole genomes with nucmer; filter alignment; compare aligned regions
                        ANIn    = Align whole genomes with nucmer; compare aligned regions
                        gANI    = Identify and align ORFs; compare aligned ORFS
                        goANI   = Open source version of gANI; requires nsmimscan
                         (default: ANImf)
  -ms MASH_SKETCH, --MASH_sketch MASH_SKETCH
                        MASH sketch size (default: 1000)
  --SkipMash            Skip MASH clustering, just do secondary clustering on
                        all genomes (default: False)
  --SkipSecondary       Skip secondary clustering, just perform MASH
                        clustering (default: False)
  --n_PRESET {normal,tight}
                        Presets to pass to nucmer
                        tight   = only align highly conserved regions
                        normal  = default ANIn parameters (default: normal)

GENOME CLUSTERING OPTIONS:
  -pa P_ANI, --P_ani P_ANI
                        ANI threshold to form primary (MASH) clusters
                        (default: 0.9)
  -sa S_ANI, --S_ani S_ANI
                        ANI threshold to form secondary clusters (default:
                        0.99)
  -nc COV_THRESH, --cov_thresh COV_THRESH
                        Minmum level of overlap between genomes when doing
                        secondary comparisons (default: 0.1)
  -cm {total,larger}, --coverage_method {total,larger}
                        Method to calculate coverage of an alignment
                        (for ANIn/ANImf only; gANI and fastANI can only do larger method)
                        total   = 2*(aligned length) / (sum of total genome lengths)
                        larger  = max((aligned length / genome 1), (aligned_length / genome2))
                         (default: larger)
  --clusterAlg {single,ward,complete,weighted,centroid,median,average}
                        Algorithm used to cluster genomes (passed to
                        scipy.cluster.hierarchy.linkage (default: average)

GREEDY CLUSTERING OPTIONS
These decrease RAM use and runtime at the expense of a minor loss in accuracy.
Recommended when clustering 5000+ genomes:
  --multiround_primary_clustering
                        Cluster each primary clunk separately and merge at the
                        end with single linkage. Decreases RAM usage and
                        increases speed, and the cost of a minor loss in
                        precision and the inability to plot
                        primary_clustering_dendrograms. Especially helpful
                        when clustering 5000+ genomes. Will be done with
                        single linkage clustering (default: False)
  --primary_chunksize PRIMARY_CHUNKSIZE
                        Impacts multiround_primary_clustering. If you have
                        more than this many genomes, process them in chunks of
                        this size. (default: 5000)
  --greedy_secondary_clustering
                        Use a heuristic to avoid pair-wise comparisons when
                        doing secondary clustering. Will be done with single
                        linkage clustering. Only works for fastANI S_algorithm
                        option at the moment (default: False)
  --run_tertiary_clustering
                        Run an additional round of clustering on the final
                        genome set. This is especially useful when greedy
                        clustering is performed and/or to handle cases where
                        similar genomes end up in different primary clusters.
                        Only works with dereplicate, not compare. (default:
                        False)

SCORING CRITERIA
Based off of the formula:
A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size) + F*(centrality - S_ani)

A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight; F = cent_weight:
  -comW COMPLETENESS_WEIGHT, --completeness_weight COMPLETENESS_WEIGHT
                        completeness weight (default: 1)
  -conW CONTAMINATION_WEIGHT, --contamination_weight CONTAMINATION_WEIGHT
                        contamination weight (default: 5)
  -strW STRAIN_HETEROGENEITY_WEIGHT, --strain_heterogeneity_weight STRAIN_HETEROGENEITY_WEIGHT
                        strain heterogeneity weight (default: 1)
  -N50W N50_WEIGHT, --N50_weight N50_WEIGHT
                        weight of log(genome N50) (default: 0.5)
  -sizeW SIZE_WEIGHT, --size_weight SIZE_WEIGHT
                        weight of log(genome size) (default: 0)
  -centW CENTRALITY_WEIGHT, --centrality_weight CENTRALITY_WEIGHT
                        Weight of (centrality - S_ani) (default: 1)

WARNINGS:
  --warn_dist WARN_DIST
                        How far from the threshold to throw cluster warnings
                        (default: 0.25)
  --warn_sim WARN_SIM   Similarity threshold for warnings between dereplicated
                        genomes (default: 0.98)
  --warn_aln WARN_ALN   Minimum aligned fraction for warnings between
                        dereplicated genomes (ANIn) (default: 0.25)

Example: dRep dereplicate output_dir/ -g /path/to/genomes/*.fasta

Work Directory

The work directory is where all of the program’s internal workings, log files, cached data, and output is stored.

See also

Example Output
for help finding where the output from your run is located in the work directory
Advanced Use
for access to the raw internal data (which can be very useful)

Genome filtering

In the dereplicate module, the genome set is quality filtered first (for why this is necessary, see Important Concepts). This is done using checkM. All genomes which don’t pass the length threshold are filtered first to avoid running checkM unnecessarily. All genomes which don’t pass checkM thresholds are filtered before comparisons are run to avoid running comparisons unnecessarily.

Warning

All genomes must have at least one ORF called or else checkM will stall, so a length minimum of at least 10,000bp is recommended.

Warnings

A series of checks are preformed to alert the user to potential problems with de-replication. There are two things that it looks for:

de-replicated genome similarity- this is comparing all of the de-replicated genomes to each other and making sure they’re not too similar. This is to try and catch cases where similar genomes were split into different primary clusters, and thus failed to be de-replicated. Depending on the number of de-replicated genomes, this can take a while

secondary clusters that were almost different- this alerts you to cases where genomes are on the edge between being considered “same” or “different”, depending on the clustering parameters you used. This module reads the parameters you used during clustering from the work directory, so you don’t need to specify them again.

Overall these warnings are a bit half-baked, however, and I personally don’t pay attention to them when running dRep myself.