dRep API

This allows you to call the internal methods of dRep using your own python program

drep.d_filter

d_filter - a subset of drep

Filter genomes based on genome length or quality. Also can run prodigal and checkM

drep.d_filter.calc_fasta_length(fasta_loc)

Calculate the length of the .fasta file and retun length

Parameters:fasta_loc – location of .fasta file
Returns:total length of all .fasta files
Return type:int
drep.d_filter.calc_genome_info(genomes: list)

Calculate the length and N50 of a list of genome locations

Parameters:genomes – list of locations of genomes
Returns:pandas dataframe with [“location”, “length”, “N50”, “genome”]
Return type:DataFrame
drep.d_filter.calc_n50(loc)

Calculate the N50 of a .fasta file

Parameters:fasta_loc – location of .fasta file.
Returns:N50 of .fasta file.
Return type:int
drep.d_filter.chdb_to_genomeInfo(chdb)

Convert the output of checkM (chdb) into genomeInfo

Parameters:chdb – dataframe of checkM
Returns:genomeInfo
Return type:DataFrame
drep.d_filter.d_filter_wrapper(wd, **kwargs)

Controller for the dRep filter operation

Parameters:
  • wd (WorkDirectory) – The current workDirectory
  • **kwargs – Command line arguments
Keyword Arguments:
 
  • genomes – genomes to filter in .fasta format
  • genomeInfo – location of .csv file with the columns: [“genome”(basename of .fasta file of that genome), “completeness”(0-100 value for completeness of the genome), “contamination”(0-100 value of the contamination of the genome)]
  • processors – Threads to use with checkM / prodigal
  • overwrite – Overwrite existing data in the work folder
  • debug – If True, make extra output when running external scripts
  • length – minimum genome length when filtering
  • completeness – minimum genome completeness when filtering
  • contamination – maximum genome contamination when filtering
  • ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
  • checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
Returns:

stores Bdb.csv, Chdb.csv, and GenomeInfo.csv in the work directory

Return type:

Nothing

drep.d_filter.filter_bdb(bdb, Gdb, **kwargs)

Filter bdb based on Gdb

Parameters:
  • bdb – DataFrame with [“genome”]
  • Gdb – DataFrame with [“genome”, “completeness”, “contamination”]
Keyword Arguments:
 
  • min_comp – Minimum genome completeness (%)
  • max_con – Maximum genome contamination (%)
Returns:

bdb filtered based on completeness and contamination

Return type:

DataFrame

drep.d_filter.run_checkM(genome_folder_whole, checkm_outf_whole, **kwargs)

Run checkM

WARNING- this will result in wrong genome length and genome N50 estimate, due to it being run on prodigal output

Parameters:
  • genome_folder – location of folder to run checkM on - should be full of files ending in .faa (result of prodigal)
  • checkm_outf – location of folder to store checkM output
Keyword Arguments:
 
  • processors – number of threads
  • checkm_method – either lineage_wf or taxonomy_wf
  • debug – log all of the commands
  • wd – if you want to log commands, you also need the wd
  • set_recursion – if not 0, set the python recursion
drep.d_filter.run_prodigal(genome_list, out_dir, **kwargs)

Run prodigal on a set of genomes, store the output in the out_dir

Parameters:
  • genome_list – list of genomes to run prodigal on
  • out_dir – output directory to store prodigal output
Keyword Arguments:
 
  • processors – number of processors to multithread with
  • exe_loc – location of prodigal excutible (will try and find with shutil if not provided)
  • debug – log all of the commands
  • wd – if you want to log commands, you also need the wd
drep.d_filter.sanity_check(bdb, **kwargs)

Make sure there are no duplicate names or anything

Parameters:bdb

Returns:

drep.d_filter.validate_chdb(Chdb, bdb)

Make sure all genomes in bdb are in Chdb

Parameters:
  • Chdb – dataframe of checkM information
  • bdb – dataframe with [‘genome’]

drep.d_cluster

drep.d_choose

d_choose - a subset of drep

Choose best genome from each cluster

drep.d_choose.add_centrality(wd, Gdb, **kwargs)

Add a columns named “centrality” to genome info

drep.d_choose.calc_centrality_from_scratch(Bdb, Cdb, data_folder)

Calculate centrality from scratch using Mash

drep.d_choose.choose_winners(Cdb, Gdb, **kwargs)

Make a scoring database and pick the winner of each cluster

Parameters:
  • Cdb – clustering database
  • Gdb – genome information database
Keyword Arguments:
 

wrapper (See) –

Returns:

[Sdb (scoring database), Wdb (winner database)]

Return type:

List

drep.d_choose.d_choose_wrapper(wd, **kwargs)

Controller for the dRep choose operation

Based off of the formula: A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size)

A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight

Parameters:
  • wd (WorkDirectory) – The current workDirectory
  • **kwargs – Command line arguments
Keyword Arguments:
 
  • genomeInfo – .csv genomeInfo file
  • ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
  • checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
  • completeness_weight – see formula
  • contamination_weight – see formula
  • strain_heterogeneity_weight – see formula
  • N50_weight – see formula
  • size_weight – see formula
Returns:

Makes Sdb (scoreDb) in the workDirectory

drep.d_choose.load_extra_weight_table(loc, genomes, **kwargs)
Parameters:
  • loc – location of extra weight table
  • genomes – list of genomes you have in Cdb
  • **kwargs – nothing realld
Returns:

dataframe with columns “genome” and “extra_weight”

drep.d_choose.pick_winners(Sdb, Cdb)

Based on clustering and scores, pick the best genome from every cluster

Parameters:
  • Sdb – score of every genome
  • Cdb – clustering
Returns:

Wdb (winner database)

Return type:

DataFrame

drep.d_choose.score_genomes(genomes, Gdb, Edb=None, **kwargs)

Calculate the scores for a list of genomes

Parameters:
  • genomes – list of genomes
  • Gdb – genome information database
Keyword Arguments:
 

wrapper (See) –

Returns:

Sdb (scoring database)

Return type:

DataFrame

drep.d_choose.score_row(row, extra=0, **kwargs)

Perform the scoring of a row based on kwargs

Parameters:

row – row of genome information

Keyword Arguments:
 
  • ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
  • completeness_weight – see formula
  • contamination_weight – see formula
  • strain_heterogeneity_weight – see formula
  • N50_weight – see formula
  • size_weight – see formula
  • = extra weight to apply (extra) –
Returns:

score

Return type:

float

drep.d_analyze

d_analyze - a subset of drep

Make plots based on de-replication

drep.d_analyze.calc_dist(x1, y1, x2, y2)

Return distance from two points

Args: self explainatory

Returns:distance
Return type:int
drep.d_analyze.cluster_test_wrapper(wd, **kwargs)

DEPRICATED

drep.d_analyze.d_analyze_wrapper(wd, **kwargs)

Controller for the dRep analyze operation

Parameters:
  • wd – The current workDirectory
  • **kwargs – Command line arguments
Keyword Arguments:
 

plots – List of plots to make [list of ints, 1-6]

Returns:

Makes some plots

drep.d_analyze.fancy_dendrogram(linkage, names, name2color=False, threshold=False, self_thresh=False)

Make a fancy dendrogram

drep.d_analyze.gen_color_dictionary(names, name2cluster)

Make the dictionary name2color

Parameters:
  • names – key in the returned dictionary
  • name2cluster – a dictionary of name to it’s cluster
Returns:

name -> color

Return type:

dict

drep.d_analyze.gen_color_list(names, name2cluster)

Make a list of colors the same length as names, based on their cluster

drep.d_analyze.get_highest_self(db, genomes, min=0.0001)

Return the highest ANI value resulting from comparing a genome to itself

drep.d_analyze.mash_dendrogram_from_wd(wd, plot_dir=False)

From the wd and kwargs, call plot_MASH_dendrogram

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Shows plot, makes a plot in the plot_dir

drep.d_analyze.normalize(df)

Normalize all columns in df to 0-1 except ‘genome’ or ‘location’

Parameters:df – DataFrame
Returns:Nomralized
Return type:DataFrame
drep.d_analyze.plot_ANIn_vs_ANIn_cov(Ndb)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_ANIn_vs_len(Mdb, Ndb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_MASH_dendrogram(Mdb, Cdb, linkage, threshold=False, plot_dir=False)

Make a dendrogram of the primary clustering

Parameters:
  • Mdb – DataFrame of Mash comparison results; make sure loaded not as categories
  • Cdb – DataFrame of Clustering results
  • linkage – Result of scipy.cluster.hierarchy.linkage
  • threshold (optional) – Line to plot on x-axis
  • plot_dir (optional) – Location to store plot
Returns:

Makes and shows plot

drep.d_analyze.plot_MASH_vs_ANIn_ani(Mdb, Ndb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_MASH_vs_ANIn_cov(Mdb, Ndb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_MASH_vs_len(Mdb, Ndb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_MASH_vs_secondary_ani(Mdb, Ndb, Cdb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_binscoring_from_wd(wd, plot_dir, **kwargs)

From the wd and kwargs, call plot_winner_scoring_complex

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Shows plot, makes a plot in the plot_dir

drep.d_analyze.plot_clustertest(linkage, names, wd, **kwargs)

DEPREICATED

names can be gotten like: db = db.pivot(“reference”,”querry”,”ani”) names = list(db.columns)

drep.d_analyze.plot_scatterplots(Mdb, Ndb, Cdb, plot_dir=False)

Make scatterplots comparing genome comparison algorithms

  • plot_MASH_vs_ANIn_ani(Mdb, Ndb) - Plot MASH_ani vs. ANIn_ani (including correlation)
  • plot_MASH_vs_ANIn_cov(Mdb, Ndb) - Plot MASH_ani vs. ANIn_cov (including correlation)
  • plot_ANIn_vs_ANIn_cov(Mdb, Ndb) - Plot ANIn vs. ANIn_cov (including correlation)
  • plot_MASH_vs_len(Mdb, Ndb) - Plot MASH_ani vs. length_difference (including correlation)
  • plot_ANIn_vs_len(Ndb) - Plot ANIn vs. length_difference (including correlation)
Parameters:
  • Mdb – DataFrame of Mash comparison results
  • Ndb – DataFrame of secondary clustering results
  • Cdb – DataFrame of Clustering results
  • plot_dir (optional) – Location to store plot
Returns:

Makes and shows plot

drep.d_analyze.plot_scatterplots_from_wd(wd, plot_dir, **kwargs)

From the wd and kwargs, call plot_scatterplots

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Shows plot, makes a plot in the plot_dir

drep.d_analyze.plot_secondary_dendrograms_from_wd(wd, plot_dir, **kwargs)

From the wd and kwargs, make the secondary dendrograms

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Makes plot

drep.d_analyze.plot_secondary_mds_from_wd(wd, plot_dir, **kwargs)

Make a .pdf of MDS of each cluster

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Makes plot

drep.d_analyze.plot_winner_scoring_complex(Wdb, Sdb, Cdb, Gdb, plot_dir=False, **kwargs)

Make a plot showing the genome scoring for all genomes

Parameters:
  • Wdb – DataFrame of winning dereplicated genomes
  • Sdb – Scores of all genomes
  • Cdb – DataFrame of Clustering results
  • Gdb – DataFrame of genome scoring information
  • plot_dir (optional) – Location to store plot
Returns:

makes plot

drep.d_analyze.plot_winners(Wdb, Gdb, Wndb, Wmdb, Widb, plot_dir=False, **kwargs)

Make a bunch of plots about the de-replicated genomes

THIS REALLY NEEDS IMPROVED UPON

drep.d_analyze.plot_winners_from_wd(wd, plot_dir, **kwargs)

From the wd and kwargs, call plot_winners

Parameters:
  • wd – WorkDirectory
  • plot_dir – Location to store figure
Returns:

Shows plot, makes a plot in the plot_dir

drep.WorkDirectory

This module provides access to the workDirectory

The directory layout:

workDirectory
./data
...../MASH_files/
...../ANIn_files/
...../gANI_files/
...../Clustering_files/
...../checkM/
........./genomes/
........./checkM_outdir/
...../prodigal/
./figures
./data_tables
...../Bdb.csv  # Sequence locations and filenames
...../Mdb.csv  # Raw results of MASH comparisons
...../Ndb.csv  # Raw results of ANIn comparisons
...../Cdb.csv  # Genomes and cluster designations
...../Chdb.csv # CheckM results for Bdb
...../Sdb.csv  # Scoring information
...../Wdb.csv  # Winning genomes
./dereplicated_genomes
./log
...../logger.log
...../cluster_arguments.json
class drep.WorkDirectory.WorkDirectory(location)

Bases: object

Object to interact with the workDirectory

Parameters:location (str) – location to make the workDirectory
firstLevels = ['data', 'figures', 'data_tables', 'dereplicated_genomes', 'log']
get_cluster(name)

Get the cluster passed in

Parameters:name – name of the cluster
Returns:cluster
get_db(name, return_none=True, forPlotting=False)

Get database from self.data_tables

Parameters:
  • name – name of dataframe
  • return_none – if True will return None if database not found; otherwise assert False
  • forPlotting – if True don’t do fancy dType loading; it messes with order of names for dendrograms
get_dir(dir)

Get the location of one of the named directory types

Parameters:dir – Name of directory to find
Returns:Location of requested directory
Return type:string
get_loc(what)

Get the location of Things

Parameters:what – string of what to get the location of
Returns:location of what
Return type:string
get_primary_linkage()

Get the primary linkage cluster

hasDb(db)

If db is in the data_tables, return True

import_arguments(loc)

Given the location of the log directory, load it

import_clusters(loc)

Given the location of the cluster files, load them

import_data_tables(loc)

Given the location of the datatables, load them

load_cached()

The wrapper to load everything it has into attributes

make_fileStructure()

Make the top level file structure

store_db(db, name, overwrite=None)

Store a dataframe in the workDirectory

Will make a physical copy in the datatables folder

Parameters:
  • db – pandas dataframe to store
  • name – name to store it under (will add .csv automatically)
  • overwrite – if True, overwrite if DataFrame with same name already exists
store_special(name, thing)

Store special items in the work directory

Parameters:
  • name – what to store
  • thing – actual thing to store