dRep API

This allows you to call the internal methods of dRep using your own python program

drep.d_filter

d_filter - a subset of drep

Filter genomes based on genome length or quality. Also can run prodigal and checkM

drep.d_filter.calc_fasta_length(fasta_loc)

Calculate the length of the .fasta file and retun length

Parameters:fasta_loc – location of .fasta file
Returns:total length of all .fasta files
Return type:int
drep.d_filter.calc_genome_info(genomes: list)

Calculate the length and N50 of a list of genome locations

Parameters:genomes – list of locations of genomes
Returns:pandas dataframe with [“location”, “length”, “N50”, “genome”]
Return type:DataFrame
drep.d_filter.calc_n50(loc)

Calculate the N50 of a .fasta file

Parameters:fasta_loc – location of .fasta file.
Returns:N50 of .fasta file.
Return type:int
drep.d_filter.chdb_to_genomeInfo(chdb)

Convert the output of checkM (chdb) into genomeInfo

Parameters:chdb – dataframe of checkM
Returns:genomeInfo
Return type:DataFrame
drep.d_filter.d_filter_wrapper(wd, **kwargs)

Controller for the dRep filter operation

Parameters:
  • wd (WorkDirectory) – The current workDirectory
  • **kwargs – Command line arguments
Keyword Arguments:
 
  • genomes – genomes to filter in .fasta format
  • genomeInfo – location of .csv file with the columns: [“genome”(basename of .fasta file of that genome), “completeness”(0-100 value for completeness of the genome), “contamination”(0-100 value of the contamination of the genome)]
  • processors – Threads to use with checkM / prodigal
  • overwrite – Overwrite existing data in the work folder
  • debug – If True, make extra output when running external scripts
  • length – minimum genome length when filtering
  • completeness – minimum genome completeness when filtering
  • contamination – maximum genome contamination when filtering
  • ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
  • checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
Returns:

stores Bdb.csv, Chdb.csv, and GenomeInfo.csv in the work directory

Return type:

Nothing

drep.d_filter.filter_bdb(bdb, Gdb, **kwargs)

Filter bdb based on Gdb

Parameters:
  • bdb – DataFrame with [“genome”]
  • Gdb – DataFrame with [“genome”, “completeness”, “contamination”]
Keyword Arguments:
 
  • min_comp – Minimum genome completeness (%)
  • max_con – Maximum genome contamination (%)
Returns:

bdb filtered based on completeness and contamination

Return type:

DataFrame

drep.d_filter.run_checkM(genome_folder, checkm_outf, **kwargs)

Run checkM

WARNING- this will result in wrong genome lenth and genome N50 estimate, due to it being run on prodigal output

Parameters:
  • genome_folder – location of folder to run checkM on - should be full of files ending in .faa (result of prodigal)
  • checkm_outf – location of folder to store checkM output
Keyword Arguments:
 
  • processors – number of threads
  • checkm_method – either lineage_wf or taxonomy_wf
  • debug – log all of the commands
  • wd – if you want to log commands, you also need the wd
  • set_recursion – if not 0, set the python recursion
drep.d_filter.run_prodigal(genome_list, out_dir, **kwargs)

Run prodigal on a set of genomes, store the output in the out_dir

Parameters:
  • genome_list – list of genomes to run prodigal on
  • out_dir – output directory to store prodigal output
Keyword Arguments:
 
  • processors – number of processors to multithread with
  • exe_loc – location of prodigal excutible (will try and find with shutil if not provided)
  • debug – log all of the commands
  • wd – if you want to log commands, you also need the wd
drep.d_filter.validate_chdb(Chdb, bdb)

Make sure all genomes in bdb are in Chdb

Parameters:
  • Chdb – dataframe of checkM information
  • bdb – dataframe with [‘genome’]

drep.d_cluster

d_cluster - a subset of dRep

Clusters a list of genomes with both primary and secondary clustering

drep.d_cluster.add_avani(db)

add a column titled ‘av_ani’ to the passed in dataframe

dataframe must have rows reference, querey, and ani

Parameters:db – dataframe
drep.d_cluster.all_vs_all_MASH(Bdb, data_folder, **kwargs)

Run MASH pairwise within all samples in Bdb

Parameters:
  • Bdb – dataframe with genome, location
  • data_folder – location to store output files
Keyword Arguments:
 
  • MASH_sketch – size of mash sketches
  • dry – dont actually run anything
  • processors – number of processors to multithread with
  • mash_exe – location of mash excutible (will try and find with shutil if not provided)
  • groupSize – max number of mash sketches to hold in each folder
  • debug – if True, log all of the commands
  • wd – if you want to log commands, you also need the wd
drep.d_cluster.cluster_genomes(genome_list, data_folder, **kwargs)

Clusters a set of genomes using the dRep primary and secondary clustering method

Takes a number of command line arguments and returns a couple pandas dataframes. Done in a number of steps

Parameters:
  • genomes – list of genomes to be clustered
  • data_folder – location where MASH and ANIn data will be stored
Keyword Arguments:
 
  • processors – Threads to use with checkM / prodigal
  • overwrite – Overwrite existing data in the work folder
  • debug – If True, make extra output when running external scripts
  • MASH_sketch – MASH sketch size
  • S_algorithm – Algorithm for secondary clustering comaprisons {ANImf,gANI,ANIn}
  • n_PRESET – Presets to pass to nucmer {normal, tight}
  • P_ANI – ANI threshold to form primary (MASH) clusters (default: 0.9)
  • S_ANI – ANI threshold to form secondary clusters (default: 0.99)
  • SkipMash – Skip MASH clustering, just do secondary clustering on all genomes
  • SkipSecondary – Skip secondary clustering, just perform MASH clustering
  • COV_THRESH – Minmum level of overlap between genomes when doing secondary comparisons (default: 0.1)
  • coverage_method – Method to calculate coverage of an alignment {total,larger}
  • CLUSTERALG – Algorithm used to cluster genomes (passed to scipy.cluster.hierarchy.linkage (default: average)
  • n_c – nucmer argument c
  • n_maxgap – nucmer argument maxgap
  • n_noextend – nucmer argument noextend
  • n_method – nucmer argument method
  • n_preset – preset nucmer arrangements: {tight,normal}
  • wd – workDirectory (needed to store clustering results and run prodigal)
Returns:

[Mdb(db of primary clustering), Ndb(db of secondary clustering, Cdb(clustering information))]

Return type:

list

drep.d_cluster.cluster_hierarchical(db, linkage_method='single', linkage_cutoff=0.1)

Perform hierarchical clustering on a symmetrical distiance matrix

Parameters:
  • db – result of db.pivot usually
  • linkage_method – passed to scipy.cluster.hierarchy.fcluster
  • linkage_cutoff – distance to draw the clustering line (default = .1)
Returns:

[Cdb, linkage]

Return type:

list

drep.d_cluster.cluster_mash_database(db, **kwargs)

From a Mash database, cluster and return Cdb

Parameters:

db – Mdb (all_vs_all Mash results)

Keyword Arguments:
 
  • clusterAlg – how to cluster database (default = single)
  • P_ani – threshold to cluster at (default = 0.9)
Returns:

[Cdb, [linkage, linkage_db, arguments]]

Return type:

list

drep.d_cluster.compare_genomes(bdb, algorithm, data_folder, **kwargs)

Compare a list of genomes using the algorithm specified

This method takes in bdb (a table with the columns location and genome), runs pair-wise comparisons between all genomes in the sample, and returns a table with at least the columns ‘reference’, ‘querry’, ‘ani’,’coverage’, depending on what algorithm is called

Parameters:
  • bdb – DataFrame with [‘genome’, ‘location’] (drep.d_filter.load_genomes)
  • algorithm – options are ANImf, ANIn, gANI
  • data_folder – location to store output files
Keyword Arguments:
 
  • wd – either this or prod_folder needed for gANI
  • prod_folder – either this or wd needed for gANI
Returns:

Ndb ([‘reference’, ‘querry’, ‘ani’,’coverage’])

Return type:

DataFrame

drep.d_cluster.d_cluster_wrapper(workDirectory, **kwargs)

Controller for the dRep cluster operation

Validates arguments, calls cluster_genomes, then stores the output

Parameters:
  • wd (WorkDirectory) – The current workDirectory
  • **kwargs – Command line arguments
Keyword Arguments:
 
  • processors – Threads to use with checkM / prodigal
  • overwrite – Overwrite existing data in the work folder
  • debug – If True, make extra output when running external scripts
  • MASH_sketch – MASH sketch size
  • S_algorithm – Algorithm for secondary clustering comaprisons {ANImf,gANI,ANIn}
  • n_PRESET – Presets to pass to nucmer {normal, tight}
  • P_ANI – ANI threshold to form primary (MASH) clusters (default: 0.9)
  • S_ANI – ANI threshold to form secondary clusters (default: 0.99)
  • SkipMash – Skip MASH clustering, just do secondary clustering on all genomes
  • SkipSecondary – Skip secondary clustering, just perform MASH clustering
  • COV_THRESH – Minmum level of overlap between genomes when doing secondary comparisons (default: 0.1)
  • coverage_method – Method to calculate coverage of an alignment {total,larger}
  • CLUSTERALG – Algorithm used to cluster genomes (passed to scipy.cluster.hierarchy.linkage (default: average)
  • genomes – genomes to cluster in .fasta format. Not necessary if already loaded sequences with the “filter” operation
Returns:

Stores Cdb, Mdb, Ndb, Bdb

drep.d_cluster.estimate_time(comps, alg)

Estimate time, in minutes, based on comparison algorithm and number of comparisons

Parameters:
  • comps – number of genomes comparisons to perform
  • alg – algorthm used
Returns:

time to perfom comparison (in minutes)

Return type:

float

drep.d_cluster.gen_animf_cmd(prefix, ref, querry, **kwargs)

return animf command. It will be a single string, with the format: “nucmer cmd ; filter cmd”

Parameters:
  • prefix – desired file output name
  • ref – location of reference genome
  • querry – location of querry genome
Keyword Arguments:
 

all – passed on to gen_nucmer_cmd

Returns:

format is “nucmer cmd ; filter cmd”

Return type:

string

drep.d_cluster.gen_filter_cmd(delta, out)

return delta-filter command

Parameters:
  • delta – desired .delta file to filter
  • out – desired output file
Returns:

cmd to run

Return type:

list

drep.d_cluster.gen_gANI_cmd(file, g1, g2, exe)

Generate a command to run gANI (ANIcalculator)

Will return [] if g1 == g2

Parameters:
  • file – output file name
  • g1 – location of the genes of genome1
  • g2 – location of the genes of genome2
  • exe – location of gANI executible
Returns:

command to run

Return type:

list

drep.d_cluster.gen_goANI_cmd(file, g1, g2, exe)

Generate a command to run goANI

Will return [] if g1 == g2

Parameters:
  • file – output file name
  • g1 – location of the genes of genome1
  • g2 – location of the genes of genome2
  • exe – location of nsimscan executible
Returns:

command to run

Return type:

list

drep.d_cluster.gen_nucmer_cmd(prefix, ref, querry, c='65', noextend=False, maxgap='90', method='mum')

Generate command to run with nucmer

Parameters:
  • prefix – desired output name (will have .delta appended)
  • ref – location of reference genome
  • querry – location of querry genomes
Keyword Arguments:
 
  • c – c value
  • noextend – either True or False
  • maxgap – maxgap
  • method – detault is ‘mum’
Returns:

command to run number

Return type:

list

drep.d_cluster.genome_hierarchical_clustering(Ndb, **kwargs)

Cluster ANI database

Parameters:

Ndb – result of secondary clustering

Keyword Arguments:
 
  • clusterAlg – how to cluster the database (default = single)
  • S_ani – thershold to cluster at (default = .99)
  • cov_thresh – minumum coverage to be included in clustering (default = .5)
  • cluster – name of the cluster
  • comp_method – comparison algorithm used
Returns:

[Cdb, {cluster:[linkage, linkage_db, arguments]}]

Return type:

list

drep.d_cluster.iteratre_clusters(Bdb, Cdb, id='primary_cluster')

An iterator: Given Bdb and Cdb, yeild smaller Bdb’s in the same cluster

Parameters:
  • Bdb – [genome, location]
  • Cdb – [genome, id]
  • id – what to iterate on (default = ‘primary_cluster’)
Returns:

[d(subset of b), cluster(name of cluster)]

Return type:

list

drep.d_cluster.load_genomes(genome_list)

Takes a list of genome locations, returns a pandas dataframe with the locations and the basename of the genomes

Parameters:genome_list – list of genome locations
Returns:pandas dataframe with columns [‘genome’, ‘location’]
Return type:DataFrame
drep.d_cluster.make_linkage_Ndb(Ndb, **kwargs)

Filter the Ndb in accordance with the kwargs. Average reciprical ANI values

Parameters:Ndb – result of secondary clutsering
Keyword Arguments:
 cov_thresh – minimum coverage threshold (default = 0.5)
Returns:pivoted DataFrame ready for clustering
Return type:DataFrame
drep.d_cluster.parse_delta(filename)

Parse a .delta file from nucmer

Parameters:filename – location of .delta file
Returns:[alignment_length, similarity_errors]
Return type:list
drep.d_cluster.parse_gani_file(file)

Parse gANI file, return dictionary of results

Parameters:file – location of gANI file
Returns:results in the gANI file
Return type:dict
drep.d_cluster.parse_nsim_file(file)

Parse nsim file, return dictionary of results and gene datatable

Parameters:file – location of nsimscan file
Returns:results in the gANI file db: gene-based alignment results
Return type:dict
drep.d_cluster.process_deltafiles(deltafiles, org_lengths, logger=None, **kwargs)

Parse a list of delta files into a pandas dataframe

Files NEED to be named in the format genome1_vs_genome2.delta, or genome1_vs_genome2.filtered.delta

Parameters:
  • deltafiles – list of .delta files
  • org_lengths – dictionary of genome to genome length
Keyword Arguments:
 
  • coverage_method – default is larger
  • logger – if not None, will log 0 errors
Returns:

information about the alignments

Return type:

DataFrame

drep.d_cluster.process_gani_files(files)

From a list of gANI output files, return a parsed DataFrame

Parameters:list – files
Returns:Ndb
Return type:DataFrame
drep.d_cluster.process_goani_files(files)

From a list of nsimscan files, return a parsed DataFrame like normal, and a special one for calculating dn/ds

Parameters:list – files
Returns:Ndb
Return type:DataFrame
drep.d_cluster.run_pairwise_ANImf(genome_list, ANIn_folder, **kwargs)

Given a list of genomes and an output folder, compare all genomes using ANImf

Parameters:
  • genome_list – list of locations of genome files
  • ANIn_folder – folder to store the output of comparison
Keyword Arguments:
 
  • processors – threads to use
  • debug – if true save extra output
  • wd – needed if debug is True
drep.d_cluster.run_pairwise_ANIn(genome_list, ANIn_folder, **kwargs)

Given a list of genomes and an output folder, compare all genomes using ANImf

Parameters:
  • genome_list – list of locations of genome files
  • ANIn_folder – folder to store the output of comparison
Keyword Arguments:
 
  • processors – threads to use
  • debug – if true save extra output
  • wd – needed if debug is True
drep.d_cluster.run_pairwise_gANI(bdb, gANI_folder, prod_folder, **kwargs)

Run pairwise gANI on a list of Genomes

Parameters:
  • bdb – DataFrame with [‘genome’, ‘location’]
  • gANI_folder – folder to store gANI output
  • prod_folder – folder containing prodigal output from genomes (will run if needed)
Keyword Arguments:
 
  • debug – log all of the commands
  • wd – if you want to log commands, you also need the wd
  • processors – threads to use
Returns:

Ndb for gANI

Return type:

DataFrame

drep.d_cluster.run_pairwise_goANI(bdb, goANI_folder, prod_folder, **kwargs)

Run pairwise goANI on a list of Genomes

Parameters:
  • bdb – DataFrame with [‘genome’, ‘location’]
  • goANI_folder – folder to store gANI output
  • prod_folder – folder containing prodigal output from genomes (will run if needed)
Keyword Arguments:
 
  • debug – log all of the commands
  • wd – if you want to log commands, you also need the wd
  • processors – threads to use
Returns:

Ndb for gANI

Return type:

DataFrame

drep.d_choose

d_choose - a subset of drep

Choose best genome from each cluster

drep.d_choose.choose_winners(Cdb, Gdb, **kwargs)

Make a scoring database and pick the winner of each cluster

Parameters:
  • Cdb – clustering database
  • Gdb – genome information database
Keyword Arguments:
 

wrapper (See) –

Returns:

[Sdb (scoring database), Wdb (winner database)]

Return type:

List

drep.d_choose.d_choose_wrapper(wd, **kwargs)

Controller for the dRep choose operation

Based off of the formula: A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size)

A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight

Parameters:
  • wd (WorkDirectory) – The current workDirectory
  • **kwargs – Command line arguments
Keyword Arguments:
 
  • genomeInfo – .csv genomeInfo file
  • ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
  • checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
  • completeness_weight – see formula
  • contamination_weight – see formula
  • strain_heterogeneity_weight – see formula
  • N50_weight – see formula
  • size_weight – see formula
Returns:

Makes Sdb (scoreDb) in the workDirectory

drep.d_choose.pick_winners(Sdb, Cdb)

Based on clustering and scores, pick the best genome from every cluster

Parameters:
  • Sdb – score of every genome
  • Cdb – clustering
Returns:

Wdb (winner database)

Return type:

DataFrame

drep.d_choose.score_genomes(genomes, Gdb, **kwargs)

Calculate the scores for a list of genomes

Parameters:
  • genomes – list of genomes
  • Gdb – genome information database
Keyword Arguments:
 

wrapper (See) –

Returns:

Sdb (scoring database)

Return type:

DataFrame

drep.d_choose.score_row(row, **kwargs)

Perform the scoring of a row based on kwargs

Parameters:

row – row of genome information

Keyword Arguments:
 
  • ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
  • completeness_weight – see formula
  • contamination_weight – see formula
  • strain_heterogeneity_weight – see formula
  • N50_weight – see formula
  • size_weight – see formula
Returns:

score

Return type:

float

drep.d_analyze

d_analyze - a subset of drep

Make plots based on de-replication

drep.d_analyze.calc_dist(x1, y1, x2, y2)

Return distance from two points

Args: self explainatory

Returns:distance
Return type:int
drep.d_analyze.cluster_test_wrapper(wd, **kwargs)

DEPRICATED

drep.d_analyze.d_analyze_wrapper(wd, **kwargs)

Controller for the dRep analyze operation

Parameters:
  • wd – The current workDirectory
  • **kwargs – Command line arguments
Keyword Arguments:
 

plots – List of plots to make [list of ints, 1-6]

Returns:

Makes some plots

drep.d_analyze.fancy_dendrogram(linkage, names, name2color=False, threshold=False, self_thresh=False)

Make a fancy dendrogram

drep.d_analyze.gen_color_dictionary(names, name2cluster)

Make the dictionary name2color

Parameters:
  • names – key in the returned dictionary
  • name2cluster – a dictionary of name to it’s cluster
Returns:

name -> color

Return type:

dict

drep.d_analyze.gen_color_list(names, name2cluster)

Make a list of colors the same length as names, based on their cluster

drep.d_analyze.get_highest_self(db, genomes, min=0.0001)

Return the highest ANI value resulting from comparing a genome to itself

drep.d_analyze.mash_dendrogram_from_wd(wd, plot_dir=False)

From the wd and kwargs, call plot_MASH_dendrogram

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Shows plot, makes a plot in the plot_dir

drep.d_analyze.normalize(df)

Normalize all columns in df to 0-1 except ‘genome’ or ‘location’

Parameters:df – DataFrame
Returns:Nomralized
Return type:DataFrame
drep.d_analyze.plot_ANIn_vs_ANIn_cov(Ndb)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_ANIn_vs_len(Mdb, Ndb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_MASH_dendrogram(Mdb, Cdb, linkage, threshold=False, plot_dir=False)

Make a dendrogram of the primary clustering

Parameters:
  • Mdb – DataFrame of Mash comparison results; make sure loaded not as categories
  • Cdb – DataFrame of Clustering results
  • linkage – Result of scipy.cluster.hierarchy.linkage
  • threshold (optional) – Line to plot on x-axis
  • plot_dir (optional) – Location to store plot
Returns:

Makes and shows plot

drep.d_analyze.plot_MASH_vs_ANIn_ani(Mdb, Ndb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_MASH_vs_ANIn_cov(Mdb, Ndb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_MASH_vs_len(Mdb, Ndb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_MASH_vs_secondary_ani(Mdb, Ndb, Cdb, exclude_zero_MASH=True)

Makes plot and retuns plt.cgf()

All parameters are obvious

drep.d_analyze.plot_binscoring_from_wd(wd, plot_dir, **kwargs)

From the wd and kwargs, call plot_winner_scoring_complex

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Shows plot, makes a plot in the plot_dir

drep.d_analyze.plot_clustertest(linkage, names, wd, **kwargs)

DEPREICATED

names can be gotten like: db = db.pivot(“reference”,”querry”,”ani”) names = list(db.columns)

drep.d_analyze.plot_scatterplots(Mdb, Ndb, Cdb, plot_dir=False)

Make scatterplots comparing genome comparison algorithms

  • plot_MASH_vs_ANIn_ani(Mdb, Ndb) - Plot MASH_ani vs. ANIn_ani (including correlation)
  • plot_MASH_vs_ANIn_cov(Mdb, Ndb) - Plot MASH_ani vs. ANIn_cov (including correlation)
  • plot_ANIn_vs_ANIn_cov(Mdb, Ndb) - Plot ANIn vs. ANIn_cov (including correlation)
  • plot_MASH_vs_len(Mdb, Ndb) - Plot MASH_ani vs. length_difference (including correlation)
  • plot_ANIn_vs_len(Ndb) - Plot ANIn vs. length_difference (including correlation)
Parameters:
  • Mdb – DataFrame of Mash comparison results
  • Ndb – DataFrame of secondary clustering results
  • Cdb – DataFrame of Clustering results
  • plot_dir (optional) – Location to store plot
Returns:

Makes and shows plot

drep.d_analyze.plot_scatterplots_from_wd(wd, plot_dir, **kwargs)

From the wd and kwargs, call plot_scatterplots

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Shows plot, makes a plot in the plot_dir

drep.d_analyze.plot_secondary_dendrograms_from_wd(wd, plot_dir, **kwargs)

From the wd and kwargs, make the secondary dendrograms

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Makes plot

drep.d_analyze.plot_secondary_mds_from_wd(wd, plot_dir, **kwargs)

Make a .pdf of MDS of each cluster

Parameters:
  • wd – WorkDirectory
  • plot_dir (optional) – Location to store figure
Returns:

Makes plot

drep.d_analyze.plot_winner_scoring_complex(Wdb, Sdb, Cdb, Gdb, plot_dir=False, **kwargs)

Make a plot showing the genome scoring for all genomes

Parameters:
  • Wdb – DataFrame of winning dereplicated genomes
  • Sdb – Scores of all genomes
  • Cdb – DataFrame of Clustering results
  • Gdb – DataFrame of genome scoring information
  • plot_dir (optional) – Location to store plot
Returns:

makes plot

drep.d_analyze.plot_winners(Wdb, Gdb, Wndb, Wmdb, Widb, plot_dir=False, **kwargs)

Make a bunch of plots about the de-replicated genomes

THIS REALLY NEEDS IMPROVED UPON

drep.d_analyze.plot_winners_from_wd(wd, plot_dir, **kwargs)

From the wd and kwargs, call plot_winners

Parameters:
  • wd – WorkDirectory
  • plot_dir – Location to store figure
Returns:

Shows plot, makes a plot in the plot_dir

drep.WorkDirectory

This module provides access to the workDirectory

The directory layout:

workDirectory
./data
...../MASH_files/
...../ANIn_files/
...../gANI_files/
...../Clustering_files/
...../checkM/
........./genomes/
........./checkM_outdir/
...../prodigal/
./figures
./data_tables
...../Bdb.csv  # Sequence locations and filenames
...../Mdb.csv  # Raw results of MASH comparisons
...../Ndb.csv  # Raw results of ANIn comparisons
...../Cdb.csv  # Genomes and cluster designations
...../Chdb.csv # CheckM results for Bdb
...../Sdb.csv  # Scoring information
...../Wdb.csv  # Winning genomes
./dereplicated_genomes
./log
...../logger.log
...../cluster_arguments.json
class drep.WorkDirectory.WorkDirectory(location)

Bases: object

Object to interact with the workDirectory

Parameters:location (str) – location to make the workDirectory
firstLevels = ['data', 'figures', 'data_tables', 'dereplicated_genomes', 'log']
get_cluster(name)

Get the cluster passed in

Parameters:name – name of the cluster
Returns:cluster
get_db(name, return_none=True, forPlotting=False)

Get database from self.data_tables

Parameters:
  • name – name of dataframe
  • return_none – if True will return None if database not found; otherwise assert False
  • forPlotting – if True don’t do fancy dType loading; it messes with order of names for dendrograms
get_dir(dir)

Get the location of one of the named directory types

Parameters:dir – Name of directory to find
Returns:Location of requested directory
Return type:string
get_loc(what)

Get the location of Things

Parameters:what – string of what to get the location of
Returns:location of what
Return type:string
get_primary_linkage()

Get the primary linkage cluster

hasDb(db)

If db is in the data_tables, return True

import_arguments(loc)

Given the location of the log directory, load it

import_clusters(loc)

Given the location of the cluster files, load them

import_data_tables(loc)

Given the location of the datatables, load them

load_cached()

The wrapper to load everything it has into attributes

make_fileStructure()

Make the top level file structure

store_db(db, name, overwrite=None)

Store a dataframe in the workDirectory

Will make a physical copy in the datatables folder

Parameters:
  • db – pandas dataframe to store
  • name – name to store it under (will add .csv automatically)
  • overwrite – if True, overwrite if DataFrame with same name already exists
store_special(name, thing)

Store special items in the work directory

Parameters:
  • name – what to store
  • thing – actual thing to store