dRep API¶
This allows you to call the internal methods of dRep using your own python program
drep.d_filter¶
d_filter - a subset of drep
Filter genomes based on genome length or quality. Also can run prodigal and checkM
-
drep.d_filter.
calc_fasta_length
(fasta_loc)¶ Calculate the length of the .fasta file and retun length
Parameters: fasta_loc – location of .fasta file Returns: total length of all .fasta files Return type: int
-
drep.d_filter.
calc_genome_info
(genomes: list)¶ Calculate the length and N50 of a list of genome locations
Parameters: genomes – list of locations of genomes Returns: pandas dataframe with [“location”, “length”, “N50”, “genome”] Return type: DataFrame
-
drep.d_filter.
calc_n50
(loc)¶ Calculate the N50 of a .fasta file
Parameters: fasta_loc – location of .fasta file. Returns: N50 of .fasta file. Return type: int
-
drep.d_filter.
chdb_to_genomeInfo
(chdb)¶ Convert the output of checkM (chdb) into genomeInfo
Parameters: chdb – dataframe of checkM Returns: genomeInfo Return type: DataFrame
-
drep.d_filter.
d_filter_wrapper
(wd, **kwargs)¶ Controller for the dRep filter operation
Parameters: - wd (WorkDirectory) – The current workDirectory
- **kwargs – Command line arguments
Keyword Arguments: - genomes – genomes to filter in .fasta format
- genomeInfo – location of .csv file with the columns: [“genome”(basename of .fasta file of that genome), “completeness”(0-100 value for completeness of the genome), “contamination”(0-100 value of the contamination of the genome)]
- processors – Threads to use with checkM / prodigal
- overwrite – Overwrite existing data in the work folder
- debug – If True, make extra output when running external scripts
- length – minimum genome length when filtering
- completeness – minimum genome completeness when filtering
- contamination – maximum genome contamination when filtering
- ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
- checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
Returns: stores Bdb.csv, Chdb.csv, and GenomeInfo.csv in the work directory
Return type: Nothing
-
drep.d_filter.
filter_bdb
(bdb, Gdb, **kwargs)¶ Filter bdb based on Gdb
Parameters: - bdb – DataFrame with [“genome”]
- Gdb – DataFrame with [“genome”, “completeness”, “contamination”]
Keyword Arguments: - min_comp – Minimum genome completeness (%)
- max_con – Maximum genome contamination (%)
Returns: bdb filtered based on completeness and contamination
Return type: DataFrame
-
drep.d_filter.
run_checkM
(genome_folder_whole, checkm_outf_whole, **kwargs)¶ Run checkM
WARNING- this will result in wrong genome length and genome N50 estimate, due to it being run on prodigal output
Parameters: - genome_folder – location of folder to run checkM on - should be full of files ending in .faa (result of prodigal)
- checkm_outf – location of folder to store checkM output
Keyword Arguments: - processors – number of threads
- checkm_method – either lineage_wf or taxonomy_wf
- debug – log all of the commands
- wd – if you want to log commands, you also need the wd
- set_recursion – if not 0, set the python recursion
-
drep.d_filter.
run_prodigal
(genome_list, out_dir, **kwargs)¶ Run prodigal on a set of genomes, store the output in the out_dir
Parameters: - genome_list – list of genomes to run prodigal on
- out_dir – output directory to store prodigal output
Keyword Arguments: - processors – number of processors to multithread with
- exe_loc – location of prodigal excutible (will try and find with shutil if not provided)
- debug – log all of the commands
- wd – if you want to log commands, you also need the wd
-
drep.d_filter.
sanity_check
(bdb, **kwargs)¶ Make sure there are no duplicate names or anything
Parameters: bdb – Returns:
-
drep.d_filter.
validate_chdb
(Chdb, bdb)¶ Make sure all genomes in bdb are in Chdb
Parameters: - Chdb – dataframe of checkM information
- bdb – dataframe with [‘genome’]
drep.d_cluster¶
drep.d_choose¶
d_choose - a subset of drep
Choose best genome from each cluster
-
drep.d_choose.
add_centrality
(wd, Gdb, **kwargs)¶ Add a columns named “centrality” to genome info
-
drep.d_choose.
calc_centrality_from_scratch
(Bdb, Cdb, data_folder)¶ Calculate centrality from scratch using Mash
-
drep.d_choose.
choose_winners
(Cdb, Gdb, **kwargs)¶ Make a scoring database and pick the winner of each cluster
Parameters: - Cdb – clustering database
- Gdb – genome information database
Keyword Arguments: wrapper (See) –
Returns: [Sdb (scoring database), Wdb (winner database)]
Return type: List
-
drep.d_choose.
d_choose_wrapper
(wd, **kwargs)¶ Controller for the dRep choose operation
Based off of the formula: A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size)
A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight
Parameters: - wd (WorkDirectory) – The current workDirectory
- **kwargs – Command line arguments
Keyword Arguments: - genomeInfo – .csv genomeInfo file
- ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
- checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
- completeness_weight – see formula
- contamination_weight – see formula
- strain_heterogeneity_weight – see formula
- N50_weight – see formula
- size_weight – see formula
Returns: Makes Sdb (scoreDb) in the workDirectory
-
drep.d_choose.
load_extra_weight_table
(loc, genomes, **kwargs)¶ Parameters: - loc – location of extra weight table
- genomes – list of genomes you have in Cdb
- **kwargs – nothing realld
Returns: dataframe with columns “genome” and “extra_weight”
-
drep.d_choose.
pick_winners
(Sdb, Cdb)¶ Based on clustering and scores, pick the best genome from every cluster
Parameters: - Sdb – score of every genome
- Cdb – clustering
Returns: Wdb (winner database)
Return type: DataFrame
-
drep.d_choose.
score_genomes
(genomes, Gdb, Edb=None, **kwargs)¶ Calculate the scores for a list of genomes
Parameters: - genomes – list of genomes
- Gdb – genome information database
Keyword Arguments: wrapper (See) –
Returns: Sdb (scoring database)
Return type: DataFrame
-
drep.d_choose.
score_row
(row, extra=0, **kwargs)¶ Perform the scoring of a row based on kwargs
Parameters: row – row of genome information
Keyword Arguments: - ignoreGenomeQuality – Don’t run checkM or do any quality-based filtering (not recommended)
- completeness_weight – see formula
- contamination_weight – see formula
- strain_heterogeneity_weight – see formula
- N50_weight – see formula
- size_weight – see formula
- = extra weight to apply (extra) –
Returns: score
Return type: float
drep.d_analyze¶
d_analyze - a subset of drep
Make plots based on de-replication
-
drep.d_analyze.
calc_dist
(x1, y1, x2, y2)¶ Return distance from two points
Args: self explainatory
Returns: distance Return type: int
-
drep.d_analyze.
cluster_test_wrapper
(wd, **kwargs)¶ DEPRICATED
-
drep.d_analyze.
d_analyze_wrapper
(wd, **kwargs)¶ Controller for the dRep analyze operation
Parameters: - wd – The current workDirectory
- **kwargs – Command line arguments
Keyword Arguments: plots – List of plots to make [list of ints, 1-6]
Returns: Makes some plots
-
drep.d_analyze.
fancy_dendrogram
(linkage, names, name2color=False, threshold=False, self_thresh=False)¶ Make a fancy dendrogram
-
drep.d_analyze.
gen_color_dictionary
(names, name2cluster)¶ Make the dictionary name2color
Parameters: - names – key in the returned dictionary
- name2cluster – a dictionary of name to it’s cluster
Returns: name -> color
Return type: dict
-
drep.d_analyze.
gen_color_list
(names, name2cluster)¶ Make a list of colors the same length as names, based on their cluster
-
drep.d_analyze.
get_highest_self
(db, genomes, min=0.0001)¶ Return the highest ANI value resulting from comparing a genome to itself
-
drep.d_analyze.
mash_dendrogram_from_wd
(wd, plot_dir=False)¶ From the wd and kwargs, call plot_MASH_dendrogram
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Shows plot, makes a plot in the plot_dir
-
drep.d_analyze.
normalize
(df)¶ Normalize all columns in df to 0-1 except ‘genome’ or ‘location’
Parameters: df – DataFrame Returns: Nomralized Return type: DataFrame
-
drep.d_analyze.
plot_ANIn_vs_ANIn_cov
(Ndb)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.
plot_ANIn_vs_len
(Mdb, Ndb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.
plot_MASH_dendrogram
(Mdb, Cdb, linkage, threshold=False, plot_dir=False)¶ Make a dendrogram of the primary clustering
Parameters: - Mdb – DataFrame of Mash comparison results; make sure loaded not as categories
- Cdb – DataFrame of Clustering results
- linkage – Result of scipy.cluster.hierarchy.linkage
- threshold (optional) – Line to plot on x-axis
- plot_dir (optional) – Location to store plot
Returns: Makes and shows plot
-
drep.d_analyze.
plot_MASH_vs_ANIn_ani
(Mdb, Ndb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.
plot_MASH_vs_ANIn_cov
(Mdb, Ndb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.
plot_MASH_vs_len
(Mdb, Ndb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.
plot_MASH_vs_secondary_ani
(Mdb, Ndb, Cdb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.
plot_binscoring_from_wd
(wd, plot_dir, **kwargs)¶ From the wd and kwargs, call plot_winner_scoring_complex
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Shows plot, makes a plot in the plot_dir
-
drep.d_analyze.
plot_clustertest
(linkage, names, wd, **kwargs)¶ DEPREICATED
names can be gotten like: db = db.pivot(“reference”,”querry”,”ani”) names = list(db.columns)
-
drep.d_analyze.
plot_scatterplots
(Mdb, Ndb, Cdb, plot_dir=False)¶ Make scatterplots comparing genome comparison algorithms
- plot_MASH_vs_ANIn_ani(Mdb, Ndb) - Plot MASH_ani vs. ANIn_ani (including correlation)
- plot_MASH_vs_ANIn_cov(Mdb, Ndb) - Plot MASH_ani vs. ANIn_cov (including correlation)
- plot_ANIn_vs_ANIn_cov(Mdb, Ndb) - Plot ANIn vs. ANIn_cov (including correlation)
- plot_MASH_vs_len(Mdb, Ndb) - Plot MASH_ani vs. length_difference (including correlation)
- plot_ANIn_vs_len(Ndb) - Plot ANIn vs. length_difference (including correlation)
Parameters: - Mdb – DataFrame of Mash comparison results
- Ndb – DataFrame of secondary clustering results
- Cdb – DataFrame of Clustering results
- plot_dir (optional) – Location to store plot
Returns: Makes and shows plot
-
drep.d_analyze.
plot_scatterplots_from_wd
(wd, plot_dir, **kwargs)¶ From the wd and kwargs, call plot_scatterplots
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Shows plot, makes a plot in the plot_dir
-
drep.d_analyze.
plot_secondary_dendrograms_from_wd
(wd, plot_dir, **kwargs)¶ From the wd and kwargs, make the secondary dendrograms
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Makes plot
-
drep.d_analyze.
plot_secondary_mds_from_wd
(wd, plot_dir, **kwargs)¶ Make a .pdf of MDS of each cluster
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Makes plot
-
drep.d_analyze.
plot_winner_scoring_complex
(Wdb, Sdb, Cdb, Gdb, plot_dir=False, **kwargs)¶ Make a plot showing the genome scoring for all genomes
Parameters: - Wdb – DataFrame of winning dereplicated genomes
- Sdb – Scores of all genomes
- Cdb – DataFrame of Clustering results
- Gdb – DataFrame of genome scoring information
- plot_dir (optional) – Location to store plot
Returns: makes plot
-
drep.d_analyze.
plot_winners
(Wdb, Gdb, Wndb, Wmdb, Widb, plot_dir=False, **kwargs)¶ Make a bunch of plots about the de-replicated genomes
THIS REALLY NEEDS IMPROVED UPON
-
drep.d_analyze.
plot_winners_from_wd
(wd, plot_dir, **kwargs)¶ From the wd and kwargs, call plot_winners
Parameters: - wd – WorkDirectory
- plot_dir – Location to store figure
Returns: Shows plot, makes a plot in the plot_dir
drep.WorkDirectory¶
This module provides access to the workDirectory
The directory layout:
workDirectory
./data
...../MASH_files/
...../ANIn_files/
...../gANI_files/
...../Clustering_files/
...../checkM/
........./genomes/
........./checkM_outdir/
...../prodigal/
./figures
./data_tables
...../Bdb.csv # Sequence locations and filenames
...../Mdb.csv # Raw results of MASH comparisons
...../Ndb.csv # Raw results of ANIn comparisons
...../Cdb.csv # Genomes and cluster designations
...../Chdb.csv # CheckM results for Bdb
...../Sdb.csv # Scoring information
...../Wdb.csv # Winning genomes
./dereplicated_genomes
./log
...../logger.log
...../cluster_arguments.json
-
class
drep.WorkDirectory.
WorkDirectory
(location)¶ Bases:
object
Object to interact with the workDirectory
Parameters: location (str) – location to make the workDirectory -
firstLevels
= ['data', 'figures', 'data_tables', 'dereplicated_genomes', 'log']¶
-
get_cluster
(name)¶ Get the cluster passed in
Parameters: name – name of the cluster Returns: cluster
-
get_db
(name, return_none=True, forPlotting=False)¶ Get database from self.data_tables
Parameters: - name – name of dataframe
- return_none – if True will return None if database not found; otherwise assert False
- forPlotting – if True don’t do fancy dType loading; it messes with order of names for dendrograms
-
get_dir
(dir)¶ Get the location of one of the named directory types
Parameters: dir – Name of directory to find Returns: Location of requested directory Return type: string
-
get_loc
(what)¶ Get the location of Things
Parameters: what – string of what to get the location of Returns: location of what Return type: string
-
get_primary_linkage
()¶ Get the primary linkage cluster
-
hasDb
(db)¶ If db is in the data_tables, return True
-
import_arguments
(loc)¶ Given the location of the log directory, load it
-
import_clusters
(loc)¶ Given the location of the cluster files, load them
-
import_data_tables
(loc)¶ Given the location of the datatables, load them
-
load_cached
()¶ The wrapper to load everything it has into attributes
-
make_fileStructure
()¶ Make the top level file structure
-
store_db
(db, name, overwrite=None)¶ Store a dataframe in the workDirectory
Will make a physical copy in the datatables folder
Parameters: - db – pandas dataframe to store
- name – name to store it under (will add .csv automatically)
- overwrite – if True, overwrite if DataFrame with same name already exists
-
store_special
(name, thing)¶ Store special items in the work directory
Parameters: - name – what to store
- thing – actual thing to store
-