drep package¶
Submodules¶
drep.WorkDirectory module¶
This module provides access to the workDirectory
The directory layout:
workDirectory
./data
...../MASH_files/
...../ANIn_files/
...../gANI_files/
...../Clustering_files/
...../checkM/
........./genomes/
........./checkM_outdir/
...../prodigal/
./figures
./data_tables
...../Bdb.csv # Sequence locations and filenames
...../Mdb.csv # Raw results of MASH comparisons
...../Ndb.csv # Raw results of ANIn comparisons
...../Cdb.csv # Genomes and cluster designations
...../Chdb.csv # CheckM results for Bdb
...../Sdb.csv # Scoring information
...../Wdb.csv # Winning genomes
./dereplicated_genomes
./log
...../logger.log
...../cluster_arguments.json
-
class
drep.WorkDirectory.WorkDirectory(location)¶ Bases:
objectObject to interact with the workDirectory
Parameters: location (str) – location to make the workDirectory -
firstLevels= ['data', 'figures', 'data_tables', 'dereplicated_genomes', 'log']¶
-
get_cluster(name)¶ Get the cluster passed in
Parameters: name – name of the cluster Returns: cluster
-
get_db(name, return_none=True)¶ Get database from self.data_tables
Parameters: - name – name of dataframe
- return_none – if True will return None if database not found; otherwise assert False
-
get_dir(dir)¶ Get the location of one of the named directory types
Parameters: dir – Name of directory to find Returns: Location of requested directory Return type: string
-
get_loc(what)¶ Get the location of Things
Parameters: what – string of what to get the location of Returns: location of what Return type: string
-
get_primary_linkage()¶ Get the primary linkage cluster
-
hasDb(db)¶ If db is in the data_tables, return True
-
import_arguments(loc)¶ Given the location of the log directory, load it
-
import_clusters(loc)¶ Given the location of the cluster files, load them
-
import_data_tables(loc)¶ Given the location of the datatables, load them
-
load_cached()¶ The wrapper to load everything it has into attributes
-
make_fileStructure()¶ Make the top level file structure
-
store_db(db, name, overwrite=False)¶ Store a dataframe in the workDirectory
Will make a physical copy in the datatables folder
Parameters: - db – pandas dataframe to store
- name – name to store it under (will add .csv automatically)
- overwrite – if True, overwrite if DataFrame with same name already exists
-
store_special(name, thing)¶ Store special items in the work directory
Parameters: - name – what to store
- thing – actual thing to store
-
drep.argumentParser module¶
dRep- parse command-line arguemnts
-
class
drep.argumentParser.SmartFormatter(prog, indent_increment=2, max_help_position=24, width=None)¶ Bases:
argparse.ArgumentDefaultsHelpFormatter
-
drep.argumentParser.VERSION= '2.0.0_beta'¶
-
drep.argumentParser.parse_args(args)¶
-
drep.argumentParser.printHelp()¶
-
drep.argumentParser.version()¶
drep.controller module¶
Controller- takes input from argparse and calls correct modules
-
class
drep.controller.Controller¶ Bases:
object-
adjust_operation(**kwargs)¶
-
analyze_operation(**kwargs)¶
-
bonus_operation(**kwargs)¶
-
choose_operation(**kwargs)¶
-
cluster_operation(**kwargs)¶
-
compare_operation(**kwargs)¶
-
dereplicate_operation(**kwargs)¶
-
evaluate_operation(**kwargs)¶
-
filter_operation(**kwargs)¶
-
loadDefaultArgs()¶
-
parseArguments(args)¶ Parse user options and call the correct pipeline
-
setup_logger(loc)¶ set up logger such that DEBUG goes only to file, rest go to file and console
-
-
drep.controller.version()¶
drep.d_adjust module¶
-
drep.d_adjust.accounce_changes(newWdb, oriWdb)¶
-
drep.d_adjust.adjust_cluster_wrapper(wd, **kwargs)¶
-
drep.d_adjust.cluster_type(cluster)¶
-
drep.d_adjust.d_adjust_wrapper(wd, **kwargs)¶
-
drep.d_adjust.remove_cluster_wrapper(wd, **kwargs)¶
-
drep.d_adjust.remove_primary_cluster(Rcluster, wd, **kwargs)¶
-
drep.d_adjust.remove_secondary_cluster(Rcluster, wd, **kwargs)¶
-
drep.d_adjust.test_adjust()¶
drep.d_analyze module¶
d_analyze - a subset of drep
Make plots based on de-replication
-
drep.d_analyze.calc_dist(x1, y1, x2, y2)¶ Return distance from two points
Args: self explainatory
Returns: distance Return type: int
-
drep.d_analyze.cluster_test_wrapper(wd, **kwargs)¶ DEPRICATED
-
drep.d_analyze.d_analyze_wrapper(wd, **kwargs)¶ Controller for the dRep analyze operation
Parameters: - wd – The current workDirectory
- **kwargs – Command line arguments
Keyword Arguments: plots – List of plots to make [list of ints, 1-6]
Returns: Makes some plots
-
drep.d_analyze.fancy_dendrogram(linkage, names, name2color=False, threshold=False, self_thresh=False)¶ Make a fancy dendrogram
-
drep.d_analyze.gen_color_dictionary(names, name2cluster)¶ Make the dictionary name2color
Parameters: - names – key in the returned dictionary
- name2cluster – a dictionary of name to it’s cluster
Returns: name -> color
Return type: dict
-
drep.d_analyze.gen_color_list(names, name2cluster)¶ Make a list of colors the same length as names, based on their cluster
-
drep.d_analyze.get_highest_self(db, genomes, min=0.0001)¶ Return the highest ANI value resulting from comparing a genome to itself
-
drep.d_analyze.mash_dendrogram_from_wd(wd, plot_dir=False)¶ From the wd and kwargs, call plot_MASH_dendrogram
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Shows plot, makes a plot in the plot_dir
-
drep.d_analyze.normalize(df)¶ Normalize all columns in df to 0-1 except ‘genome’ or ‘location’
Parameters: df – DataFrame Returns: Nomralized Return type: DataFrame
-
drep.d_analyze.plot_ANIn_vs_ANIn_cov(Ndb)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.plot_ANIn_vs_len(Mdb, Ndb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.plot_MASH_dendrogram(Mdb, Cdb, linkage, threshold=False, plot_dir=False)¶ Make a dendrogram of the primary clustering
Parameters: - Mdb – DataFrame of Mash comparison results
- Cdb – DataFrame of Clustering results
- linkage – Result of scipy.cluster.hierarchy.linkage
- threshold (optional) – Line to plot on x-axis
- plot_dir (optional) – Location to store plot
Returns: Makes and shows plot
-
drep.d_analyze.plot_MASH_vs_ANIn_ani(Mdb, Ndb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.plot_MASH_vs_ANIn_cov(Mdb, Ndb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.plot_MASH_vs_len(Mdb, Ndb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.plot_MASH_vs_secondary_ani(Mdb, Ndb, Cdb, exclude_zero_MASH=True)¶ Makes plot and retuns plt.cgf()
All parameters are obvious
-
drep.d_analyze.plot_binscoring_from_wd(wd, plot_dir, **kwargs)¶ From the wd and kwargs, call plot_winner_scoring_complex
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Shows plot, makes a plot in the plot_dir
-
drep.d_analyze.plot_clustertest(linkage, names, wd, **kwargs)¶ DEPREICATED
names can be gotten like: db = db.pivot(“reference”,”querry”,”ani”) names = list(db.columns)
-
drep.d_analyze.plot_scatterplots(Mdb, Ndb, Cdb, plot_dir=False)¶ Make scatterplots comparing genome comparison algorithms
- plot_MASH_vs_ANIn_ani(Mdb, Ndb) - Plot MASH_ani vs. ANIn_ani (including correlation)
- plot_MASH_vs_ANIn_cov(Mdb, Ndb) - Plot MASH_ani vs. ANIn_cov (including correlation)
- plot_ANIn_vs_ANIn_cov(Mdb, Ndb) - Plot ANIn vs. ANIn_cov (including correlation)
- plot_MASH_vs_len(Mdb, Ndb) - Plot MASH_ani vs. length_difference (including correlation)
- plot_ANIn_vs_len(Ndb) - Plot ANIn vs. length_difference (including correlation)
Parameters: - Mdb – DataFrame of Mash comparison results
- Ndb – DataFrame of secondary clustering results
- Cdb – DataFrame of Clustering results
- plot_dir (optional) – Location to store plot
Returns: Makes and shows plot
-
drep.d_analyze.plot_scatterplots_from_wd(wd, plot_dir, **kwargs)¶ From the wd and kwargs, call plot_scatterplots
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Shows plot, makes a plot in the plot_dir
-
drep.d_analyze.plot_secondary_dendrograms_from_wd(wd, plot_dir, **kwargs)¶ From the wd and kwargs, make the secondary dendrograms
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Makes plot
-
drep.d_analyze.plot_secondary_mds_from_wd(wd, plot_dir, **kwargs)¶ Make a .pdf of MDS of each cluster
Parameters: - wd – WorkDirectory
- plot_dir (optional) – Location to store figure
Returns: Makes plot
-
drep.d_analyze.plot_winner_scoring_complex(Wdb, Sdb, Cdb, Gdb, plot_dir=False, **kwargs)¶ Make a plot showing the genome scoring for all genomes
Parameters: - Wdb – DataFrame of winning dereplicated genomes
- Sdb – Scores of all genomes
- Cdb – DataFrame of Clustering results
- Gdb – DataFrame of genome scoring information
- plot_dir (optional) – Location to store plot
Returns: makes plot
-
drep.d_analyze.plot_winners(Wdb, Gdb, Wndb, Wmdb, Widb, plot_dir=False, **kwargs)¶ Make a bunch of plots about the de-replicated genomes
THIS REALLY NEEDS IMPROVED UPON
-
drep.d_analyze.plot_winners_from_wd(wd, plot_dir, **kwargs)¶ From the wd and kwargs, call plot_winners
Parameters: - wd – WorkDirectory
- plot_dir – Location to store figure
Returns: Shows plot, makes a plot in the plot_dir
drep.d_bonus module¶
-
drep.d_bonus.add_taxonomy(Bdb, Tdb)¶
-
drep.d_bonus.check_dependencies(wd, **kwargs)¶ For all possible dependencies, see if you can find them
-
drep.d_bonus.d_bonus_wrapper(wd, **kwargs)¶
-
drep.d_bonus.find_program(dep)¶ return location of progrgam, works = True/False (based on calling the help)
-
drep.d_bonus.gen_centrifuge_cmd(genes, cent, **kwargs)¶
-
drep.d_bonus.gen_phylo_db(hits)¶
-
drep.d_bonus.gen_read2bin(gene_files)¶
-
drep.d_bonus.get_scaff(read)¶
-
drep.d_bonus.lineage_from_taxId(t)¶
-
drep.d_bonus.parse_centrifuge(Bdb, cent_dir, **kwargs)¶
-
drep.d_bonus.parse_centrifuge_percent(Bdb, cent_dir, **kwargs)¶
-
drep.d_bonus.parse_raw_centrifuge(hits, report)¶
-
drep.d_bonus.parse_taxonomy(Bdb, cent_dir, **kwargs)¶ take the centrifuge directory and Bdb, return Tdb and Bdb with an added ‘taxonomy’ column
-
drep.d_bonus.run_centrifuge(Bdb, prod_dir, cent_dir, **kwargs)¶
-
drep.d_bonus.run_taxonomy(wd, **kwargs)¶
-
drep.d_bonus.tdb_from_hits(hits, minPerc=50, testing=False)¶ Determines the lowest taxonomic level with at least minPerc certainty
- For every hit:
- reconstruct the lineage (kingdom, phylum, class, ect.) add a count to every rank in the lineage
- For every rank:
- see if the number of hits matching one taxa at that rank is above the minPerc the denominator for this equation is the number of hits that have a phyla rank
- Note: this is complicated because some lower ranks don’t have higher ranks
- For example, species [Eubacterium] rectale (taxID 39491) has no genus Also, species [artifical construct] (taxID 32630) has no anything but species
-
drep.d_bonus.test_bonus()¶
-
drep.d_bonus.validate_arguments(wd, **kwargs)¶ make sure you have everything you need
drep.d_choose module¶
d_choose - a subset of drep
Choose best genome from each cluster
-
drep.d_choose.choose_winners(Cdb, Gdb, **kwargs)¶ Make a scoring database and pick the winner of each cluster
Parameters: - Cdb – clustering database
- Gdb – genome information database
Keyword Arguments: wrapper (See) –
Returns: [Sdb (scoring database), Wdb (winner database)]
Return type: List
-
drep.d_choose.d_choose_wrapper(wd, **kwargs)¶ Controller for the dRep choose operation
Based off of the formula: A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size)
A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight
Parameters: - wd (WorkDirectory) – The current workDirectory
- **kwargs – Command line arguments
Keyword Arguments: - genomeInfo – .csv genomeInfo file
- noQualityFiltering – Don’t run checkM or do any quality-based filtering (not recommended)
- checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
- completeness_weight – see formula
- contamination_weight – see formula
- strain_heterogeneity_weight – see formula
- N50_weight – see formula
- size_weight – see formula
Returns: Makes Sdb (scoreDb) in the workDirectory
-
drep.d_choose.pick_winners(Sdb, Cdb)¶ Based on clustering and scores, pick the best genome from every cluster
Parameters: - Sdb – score of every genome
- Cdb – clustering
Returns: Wdb (winner database)
Return type: DataFrame
-
drep.d_choose.score_genomes(genomes, Gdb, **kwargs)¶ Calculate the scores for a list of genomes
Parameters: - genomes – list of genomes
- Gdb – genome information database
Keyword Arguments: wrapper (See) –
Returns: Sdb (scoring database)
Return type: DataFrame
-
drep.d_choose.score_row(row, **kwargs)¶ Perform the scoring of a row based on kwargs
Parameters: row – row of genome information
Keyword Arguments: - noQualityFiltering – Don’t run checkM or do any quality-based filtering (not recommended)
- completeness_weight – see formula
- contamination_weight – see formula
- strain_heterogeneity_weight – see formula
- N50_weight – see formula
- size_weight – see formula
Returns: score
Return type: float
drep.d_cluster module¶
d_cluster - a subset of dRep
Clusters a list of genomes with both primary and secondary clustering
-
drep.d_cluster.add_avani(db)¶ add a column titled ‘av_ani’ to the passed in dataframe
dataframe must have rows reference, querey, and ani
Parameters: db – dataframe
-
drep.d_cluster.all_vs_all_MASH(Bdb, data_folder, **kwargs)¶ Run MASH pairwise within all samples in Bdb
Parameters: - Bdb – dataframe with genome, location
- data_folder – location to store output files
Keyword Arguments: - MASH_sketch – size of mash sketches
- dry – dont actually run anything
- processors – number of processors to multithread with
- mash_exe – location of mash excutible (will try and find with shutil if not provided)
- groupSize – max number of mash sketches to hold in each folder
- debug – if True, log all of the commands
- wd – if you want to log commands, you also need the wd
-
drep.d_cluster.cluster_genomes(genome_list, data_folder, **kwargs)¶ Clusters a set of genomes using the dRep primary and secondary clustering method
Takes a number of command line arguments and returns a couple pandas dataframes. Done in a number of steps
Parameters: - genomes – list of genomes to be clustered
- data_folder – location where MASH and ANIn data will be stored
Keyword Arguments: - processors – Threads to use with checkM / prodigal
- overwrite – Overwrite existing data in the work folder
- debug – If True, make extra output when running external scripts
- MASH_sketch – MASH sketch size
- S_algorithm – Algorithm for secondary clustering comaprisons {ANImf,gANI,ANIn}
- n_PRESET – Presets to pass to nucmer {normal, tight}
- P_ANI – ANI threshold to form primary (MASH) clusters (default: 0.9)
- S_ANI – ANI threshold to form secondary clusters (default: 0.99)
- SkipMash – Skip MASH clustering, just do secondary clustering on all genomes
- SkipSecondary – Skip secondary clustering, just perform MASH clustering
- COV_THRESH – Minmum level of overlap between genomes when doing secondary comparisons (default: 0.1)
- coverage_method – Method to calculate coverage of an alignment {total,larger}
- CLUSTERALG – Algorithm used to cluster genomes (passed to scipy.cluster.hierarchy.linkage (default: average)
- n_c – nucmer argument c
- n_maxgap – nucmer argument maxgap
- n_noextend – nucmer argument noextend
- n_method – nucmer argument method
- n_preset – preset nucmer arrangements: {tight,normal}
- wd – workDirectory (needed to store clustering results and run prodigal)
Returns: [Mdb(db of primary clustering), Ndb(db of secondary clustering, Cdb(clustering information))]
Return type: list
-
drep.d_cluster.cluster_hierarchical(db, linkage_method='single', linkage_cutoff=0.1)¶ Perform hierarchical clustering on a symmetrical distiance matrix
Parameters: - db – result of db.pivot usually
- linkage_method – passed to scipy.cluster.hierarchy.fcluster
- linkage_cutoff – distance to draw the clustering line (default = .1)
Returns: [Cdb, linkage]
Return type: list
-
drep.d_cluster.cluster_mash_database(db, **kwargs)¶ From a Mash database, cluster and return Cdb
Parameters: db – Mdb (all_vs_all Mash results)
Keyword Arguments: - clusterAlg – how to cluster database (default = single)
- P_ani – threshold to cluster at (default = 0.9)
Returns: [Cdb, [linkage, linkage_db, arguments]]
Return type: list
-
drep.d_cluster.compare_genomes(bdb, algorithm, data_folder, **kwargs)¶ Compare a list of genomes using the algorithm specified
This method takes in bdb (a table with the columns location and genome), runs pair-wise comparisons between all genomes in the sample, and returns a table with at least the columns ‘reference’, ‘querry’, ‘ani’,’coverage’, depending on what algorithm is called
Parameters: - bdb – DataFrame with [‘genome’, ‘location’] (drep.d_filter.load_genomes)
- algorithm – options are ANImf, ANIn, gANI
- data_folder – location to store output files
Keyword Arguments: - wd – either this or prod_folder needed for gANI
- prod_folder – either this or wd needed for gANI
Returns: Ndb ([‘reference’, ‘querry’, ‘ani’,’coverage’])
Return type: DataFrame
-
drep.d_cluster.d_cluster_wrapper(workDirectory, **kwargs)¶ Controller for the dRep cluster operation
Validates arguments, calls cluster_genomes, then stores the output
Parameters: - wd (WorkDirectory) – The current workDirectory
- **kwargs – Command line arguments
Keyword Arguments: - processors – Threads to use with checkM / prodigal
- overwrite – Overwrite existing data in the work folder
- debug – If True, make extra output when running external scripts
- MASH_sketch – MASH sketch size
- S_algorithm – Algorithm for secondary clustering comaprisons {ANImf,gANI,ANIn}
- n_PRESET – Presets to pass to nucmer {normal, tight}
- P_ANI – ANI threshold to form primary (MASH) clusters (default: 0.9)
- S_ANI – ANI threshold to form secondary clusters (default: 0.99)
- SkipMash – Skip MASH clustering, just do secondary clustering on all genomes
- SkipSecondary – Skip secondary clustering, just perform MASH clustering
- COV_THRESH – Minmum level of overlap between genomes when doing secondary comparisons (default: 0.1)
- coverage_method – Method to calculate coverage of an alignment {total,larger}
- CLUSTERALG – Algorithm used to cluster genomes (passed to scipy.cluster.hierarchy.linkage (default: average)
- genomes – genomes to cluster in .fasta format. Not necessary if already loaded sequences with the “filter” operation
Returns: Stores Cdb, Mdb, Ndb, Bdb
-
drep.d_cluster.estimate_time(comps, alg)¶ Estimate time, in minutes, based on comparison algorithm and number of comparisons
Parameters: - comps – number of genomes comparisons to perform
- alg – algorthm used
Returns: time to perfom comparison (in minutes)
Return type: float
-
drep.d_cluster.gen_animf_cmd(prefix, ref, querry, **kwargs)¶ return animf command. It will be a single string, with the format: “nucmer cmd ; filter cmd”
Parameters: - prefix – desired file output name
- ref – location of reference genome
- querry – location of querry genome
Keyword Arguments: all – passed on to gen_nucmer_cmd
Returns: format is “nucmer cmd ; filter cmd”
Return type: string
-
drep.d_cluster.gen_filter_cmd(delta, out)¶ return delta-filter command
Parameters: - delta – desired .delta file to filter
- out – desired output file
Returns: cmd to run
Return type: list
-
drep.d_cluster.gen_gANI_cmd(file, g1, g2, exe)¶ Generate a command to run gANI (ANIcalculator)
Will return [] if g1 == g2
Parameters: - file – output file name
- g1 – location of the genes of genome1
- g2 – location of the genes of genome2
- exe – location of gANI executible
Returns: command to run
Return type: list
-
drep.d_cluster.gen_nucmer_cmd(prefix, ref, querry, c='65', noextend=False, maxgap='90', method='mum')¶ Generate command to run with nucmer
Parameters: - prefix – desired output name (will have .delta appended)
- ref – location of reference genome
- querry – location of querry genomes
Keyword Arguments: - c – c value
- noextend – either True or False
- maxgap – maxgap
- method – detault is ‘mum’
Returns: command to run number
Return type: list
-
drep.d_cluster.genome_hierarchical_clustering(Ndb, **kwargs)¶ Cluster ANI database
Parameters: Ndb – result of secondary clustering
Keyword Arguments: - clusterAlg – how to cluster the database (default = single)
- S_ani – thershold to cluster at (default = .99)
- cov_thresh – minumum coverage to be included in clustering (default = .5)
- cluster – name of the cluster
- comp_method – comparison algorithm used
Returns: [Cdb, {cluster:[linkage, linkage_db, arguments]}]
Return type: list
-
drep.d_cluster.iteratre_clusters(Bdb, Cdb, id='primary_cluster')¶ An iterator: Given Bdb and Cdb, yeild smaller Bdb’s in the same cluster
Parameters: - Bdb – [genome, location]
- Cdb – [genome, id]
- id – what to iterate on (default = ‘primary_cluster’)
Returns: [d(subset of b), cluster(name of cluster)]
Return type: list
-
drep.d_cluster.load_genomes(genome_list)¶ Takes a list of genome locations, returns a pandas dataframe with the locations and the basename of the genomes
Parameters: genome_list – list of genome locations Returns: pandas dataframe with columns [‘genome’, ‘location’] Return type: DataFrame
-
drep.d_cluster.make_linkage_Ndb(Ndb, **kwargs)¶ Filter the Ndb in accordance with the kwargs. Average reciprical ANI values
Parameters: Ndb – result of secondary clutsering Keyword Arguments: cov_thresh – minimum coverage threshold (default = 0.5) Returns: pivoted DataFrame ready for clustering Return type: DataFrame
-
drep.d_cluster.parse_delta(filename)¶ Parse a .delta file from nucmer
Parameters: filename – location of .delta file Returns: [alignment_length, similarity_errors] Return type: list
-
drep.d_cluster.parse_gani_file(file)¶ Parse gANI file, return dictionary of results
Parameters: file – location of gANI file Returns: results in the gANI file Return type: dict
-
drep.d_cluster.process_deltafiles(deltafiles, org_lengths, logger=None, **kwargs)¶ Parse a list of delta files into a pandas dataframe
Files NEED to be named in the format genome1_vs_genome2.delta, or genome1_vs_genome2.filtered.delta
Parameters: - deltafiles – list of .delta files
- org_lengths – dictionary of genome to genome length
Keyword Arguments: - coverage_method – default is larger
- logger – if not None, will log 0 errors
Returns: information about the alignments
Return type: DataFrame
-
drep.d_cluster.process_gani_files(files)¶ From a list of gANI output files, return a parsed DataFrame
Parameters: list – files Returns: Ndb Return type: DataFrame
-
drep.d_cluster.run_pairwise_ANImf(genome_list, ANIn_folder, **kwargs)¶ Given a list of genomes and an output folder, compare all genomes using ANImf
Parameters: - genome_list – list of locations of genome files
- ANIn_folder – folder to store the output of comparison
Keyword Arguments: - processors – threads to use
- debug – if true save extra output
- wd – needed if debug is True
-
drep.d_cluster.run_pairwise_ANIn(genome_list, ANIn_folder, **kwargs)¶ Given a list of genomes and an output folder, compare all genomes using ANImf
Parameters: - genome_list – list of locations of genome files
- ANIn_folder – folder to store the output of comparison
Keyword Arguments: - processors – threads to use
- debug – if true save extra output
- wd – needed if debug is True
-
drep.d_cluster.run_pairwise_gANI(bdb, gANI_folder, prod_folder, **kwargs)¶ Run pairwise gANI on a list of Genomes
Parameters: - bdb – DataFrame with [‘genome’, ‘location’]
- gANI_folder – folder to store gANI output
- prod_folder – folder containing prodigal output from genomes (will run if needed)
Keyword Arguments: - debug – log all of the commands
- wd – if you want to log commands, you also need the wd
- processors – threads to use
Returns: Ndb for gANI
Return type: DataFrame
drep.d_evaluate module¶
-
drep.d_evaluate.comp_str(val)¶
-
drep.d_evaluate.compare_winners(wd, **kwargs)¶
-
drep.d_evaluate.con_str(val)¶
-
drep.d_evaluate.d_evaluate_wrapper(wd, **kwargs)¶
-
drep.d_evaluate.evaluate_warnings(wd, **kwargs)¶
-
drep.d_evaluate.evaluate_winners(wd, **kwrags)¶
-
drep.d_evaluate.test_evaluate()¶
drep.d_filter module¶
d_filter - a subset of drep
Filter genomes based on genome length or quality. Also can run prodigal and checkM
-
drep.d_filter.calc_fasta_length(fasta_loc)¶ Calculate the length of the .fasta file and retun length
Parameters: fasta_loc – location of .fasta file Returns: total length of all .fasta files Return type: int
-
drep.d_filter.calc_genome_info(genomes: list)¶ Calculate the length and N50 of a list of genome locations
Parameters: genomes – list of locations of genomes Returns: pandas dataframe with [“location”, “length”, “N50”, “genome”] Return type: DataFrame
-
drep.d_filter.calc_n50(loc)¶ Calculate the N50 of a .fasta file
Parameters: fasta_loc – location of .fasta file. Returns: N50 of .fasta file. Return type: int
-
drep.d_filter.chdb_to_genomeInfo(chdb)¶ Convert the output of checkM (chdb) into genomeInfo
Parameters: chdb – dataframe of checkM Returns: genomeInfo Return type: DataFrame
-
drep.d_filter.d_filter_wrapper(wd, **kwargs)¶ Controller for the dRep filter operation
Parameters: - wd (WorkDirectory) – The current workDirectory
- **kwargs – Command line arguments
Keyword Arguments: - genomes – genomes to filter in .fasta format
- genomeInfo – location of .csv file with the columns: [“genome”(basename of .fasta file of that genome), “completeness”(0-100 value for completeness of the genome), “contamination”(0-100 value of the contamination of the genome)]
- processors – Threads to use with checkM / prodigal
- overwrite – Overwrite existing data in the work folder
- debug – If True, make extra output when running external scripts
- length – minimum genome length when filtering
- completeness – minimum genome completeness when filtering
- contamination – maximum genome contamination when filtering
- noQualityFiltering – Don’t run checkM or do any quality-based filtering (not recommended)
- checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
Returns: stores Bdb.csv, Chdb.csv, and GenomeInfo.csv in the work directory
Return type: Nothing
-
drep.d_filter.filter_bdb(bdb, Gdb, **kwargs)¶ Filter bdb based on Gdb
Parameters: - bdb – DataFrame with [“genome”]
- Gdb – DataFrame with [“genome”, “completeness”, “contamination”]
Keyword Arguments: - min_comp – Minimum genome completeness (%)
- max_con – Maximum genome contamination (%)
Returns: bdb filtered based on completeness and contamination
Return type: DataFrame
-
drep.d_filter.run_checkM(genome_folder, checkm_outf, **kwargs)¶ Run checkM
WARNING- this will result in wrong genome lenth and genome N50 estimate, due to it being run on prodigal output
Parameters: - genome_folder – location of folder to run checkM on - should be full of files ending in .faa (result of prodigal)
- checkm_outf – location of folder to store checkM output
Keyword Arguments: - processors – number of threads
- checkm_method – either lineage_wf or taxonomy_wf
- debug – log all of the commands
- wd – if you want to log commands, you also need the wd
-
drep.d_filter.run_prodigal(genome_list, out_dir, **kwargs)¶ Run prodigal on a set of genomes, store the output in the out_dir
Parameters: - genome_list – list of genomes to run prodigal on
- out_dir – output directory to store prodigal output
Keyword Arguments: - processors – number of processors to multithread with
- exe_loc – location of prodigal excutible (will try and find with shutil if not provided)
- debug – log all of the commands
- wd – if you want to log commands, you also need the wd
-
drep.d_filter.validate_chdb(Chdb, bdb)¶ Make sure all genomes in bdb are in Chdb
Parameters: - Chdb – dataframe of checkM information
- bdb – dataframe with [‘genome’]
drep.d_workflows module¶
-
drep.d_workflows.compare_wrapper(wd, **kwargs)¶
-
drep.d_workflows.dereplicate_wrapper(wd, **kwargs)¶
-
drep.d_workflows.validate_compare(wd, **kwargs)¶
-
drep.d_workflows.validate_dereplicate(wd, **kwargs)¶
Module contents¶
-
drep.clobber_dir(outdirname, dry=False, overwrite=False)¶
-
drep.get_exe(name)¶ Given the name of a program, return the path to the exe if I can find it
Parameters: name – name of program Returns: location to program exe Return type: string
-
drep.make_dir(outdirname, dry=False, overwrite=False)¶
-
drep.run_cmd(cmd, dry=False, shell=True, logdir=False)¶
-
drep.thread_cmd_wrapper(tup)¶
-
drep.thread_cmds(cmds, dry=False, shell=False, logdir=False, t=10)¶