dRep API¶

This allows you to call the internal methods of dRep using your own python program

drep.d_filter¶

d_filter - a subset of drep

Filter genomes based on genome length or quality. Also can run prodigal and checkM

drep.d_filter.calc_fasta_length(fasta_loc)¶

Calculate the length of the .fasta file and retun length

Parameters:	fasta_loc – location of .fasta file
Returns:	total length of all .fasta files
Return type:	int

drep.d_filter.calc_genome_info(genomes: list)¶

Calculate the length and N50 of a list of genome locations

Parameters:	genomes – list of locations of genomes
Returns:	pandas dataframe with [“location”, “length”, “N50”, “genome”]
Return type:	DataFrame

drep.d_filter.calc_n50(loc)¶

Calculate the N50 of a .fasta file

Parameters:	fasta_loc – location of .fasta file.
Returns:	N50 of .fasta file.
Return type:	int

drep.d_filter.chdb_to_genomeInfo(chdb)¶

Convert the output of checkM (chdb) into genomeInfo

Parameters:	chdb – dataframe of checkM
Returns:	genomeInfo
Return type:	DataFrame

drep.d_filter.d_filter_wrapper(wd, **kwargs)¶

Controller for the dRep filter operation

Keyword Arguments:
Parameters:	wd (WorkDirectory) – The current workDirectory **kwargs – Command line arguments
	genomes – genomes to filter in .fasta format genomeInfo – location of .csv file with the columns: [“genome”(basename of .fasta file of that genome), “completeness”(0-100 value for completeness of the genome), “contamination”(0-100 value of the contamination of the genome)] processors – Threads to use with checkM / prodigal overwrite – Overwrite existing data in the work folder debug – If True, make extra output when running external scripts length – minimum genome length when filtering completeness – minimum genome completeness when filtering contamination – maximum genome contamination when filtering noQualityFiltering – Don’t run checkM or do any quality-based filtering (not recommended) checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster)
Returns:	stores Bdb.csv, Chdb.csv, and GenomeInfo.csv in the work directory
Return type:	Nothing

drep.d_filter.filter_bdb(bdb, Gdb, **kwargs)¶

Filter bdb based on Gdb

Keyword Arguments:
Parameters:	bdb – DataFrame with [“genome”] Gdb – DataFrame with [“genome”, “completeness”, “contamination”]
	min_comp – Minimum genome completeness (%) max_con – Maximum genome contamination (%)
Returns:	bdb filtered based on completeness and contamination
Return type:	DataFrame

drep.d_filter.run_checkM(genome_folder, checkm_outf, **kwargs)¶

Run checkM

WARNING- this will result in wrong genome lenth and genome N50 estimate, due to it being run on prodigal output

Keyword Arguments:
Parameters:	genome_folder – location of folder to run checkM on - should be full of files ending in .faa (result of prodigal) checkm_outf – location of folder to store checkM output
	processors – number of threads checkm_method – either lineage_wf or taxonomy_wf debug – log all of the commands wd – if you want to log commands, you also need the wd

drep.d_filter.run_prodigal(genome_list, out_dir, **kwargs)¶

Run prodigal on a set of genomes, store the output in the out_dir

Keyword Arguments:
Parameters:	genome_list – list of genomes to run prodigal on out_dir – output directory to store prodigal output
	processors – number of processors to multithread with exe_loc – location of prodigal excutible (will try and find with shutil if not provided) debug – log all of the commands wd – if you want to log commands, you also need the wd

drep.d_filter.validate_chdb(Chdb, bdb)¶

Make sure all genomes in bdb are in Chdb

Parameters:	Chdb – dataframe of checkM information bdb – dataframe with [‘genome’]

drep.d_cluster¶

d_cluster - a subset of dRep

Clusters a list of genomes with both primary and secondary clustering

drep.d_cluster.add_avani(db)¶

add a column titled ‘av_ani’ to the passed in dataframe

dataframe must have rows reference, querey, and ani

Parameters:	db – dataframe

drep.d_cluster.all_vs_all_MASH(Bdb, data_folder, **kwargs)¶

Run MASH pairwise within all samples in Bdb

Parameters:

Bdb – dataframe with genome, location
data_folder – location to store output files

Keyword Arguments:

MASH_sketch – size of mash sketches
dry – dont actually run anything
processors – number of processors to multithread with
mash_exe – location of mash excutible (will try and find with shutil if not provided)
groupSize – max number of mash sketches to hold in each folder
debug – if True, log all of the commands
wd – if you want to log commands, you also need the wd

drep.d_cluster.cluster_genomes(genome_list, data_folder, **kwargs)¶

Clusters a set of genomes using the dRep primary and secondary clustering method

Takes a number of command line arguments and returns a couple pandas dataframes. Done in a number of steps

Keyword Arguments:
Parameters:	genomes – list of genomes to be clustered data_folder – location where MASH and ANIn data will be stored
	processors – Threads to use with checkM / prodigal overwrite – Overwrite existing data in the work folder debug – If True, make extra output when running external scripts MASH_sketch – MASH sketch size S_algorithm – Algorithm for secondary clustering comaprisons {ANImf,gANI,ANIn} n_PRESET – Presets to pass to nucmer {normal, tight} P_ANI – ANI threshold to form primary (MASH) clusters (default: 0.9) S_ANI – ANI threshold to form secondary clusters (default: 0.99) SkipMash – Skip MASH clustering, just do secondary clustering on all genomes SkipSecondary – Skip secondary clustering, just perform MASH clustering COV_THRESH – Minmum level of overlap between genomes when doing secondary comparisons (default: 0.1) coverage_method – Method to calculate coverage of an alignment {total,larger} CLUSTERALG – Algorithm used to cluster genomes (passed to scipy.cluster.hierarchy.linkage (default: average) n_c – nucmer argument c n_maxgap – nucmer argument maxgap n_noextend – nucmer argument noextend n_method – nucmer argument method n_preset – preset nucmer arrangements: {tight,normal} wd – workDirectory (needed to store clustering results and run prodigal)
Returns:	[Mdb(db of primary clustering), Ndb(db of secondary clustering, Cdb(clustering information))]
Return type:	list

drep.d_cluster.cluster_hierarchical(db, linkage_method='single', linkage_cutoff=0.1)¶

Perform hierarchical clustering on a symmetrical distiance matrix

Parameters:	db – result of db.pivot usually linkage_method – passed to scipy.cluster.hierarchy.fcluster linkage_cutoff – distance to draw the clustering line (default = .1)
Returns:	[Cdb, linkage]
Return type:	list

drep.d_cluster.cluster_mash_database(db, **kwargs)¶

From a Mash database, cluster and return Cdb

Keyword Arguments:
Parameters:	db – Mdb (all_vs_all Mash results)
	clusterAlg – how to cluster database (default = single) P_ani – threshold to cluster at (default = 0.9)
Returns:	[Cdb, [linkage, linkage_db, arguments]]
Return type:	list

drep.d_cluster.compare_genomes(bdb, algorithm, data_folder, **kwargs)¶

Compare a list of genomes using the algorithm specified

This method takes in bdb (a table with the columns location and genome), runs pair-wise comparisons between all genomes in the sample, and returns a table with at least the columns ‘reference’, ‘querry’, ‘ani’,’coverage’, depending on what algorithm is called

Keyword Arguments:
Parameters:	bdb – DataFrame with [‘genome’, ‘location’] (drep.d_filter.load_genomes) algorithm – options are ANImf, ANIn, gANI data_folder – location to store output files
	wd – either this or prod_folder needed for gANI prod_folder – either this or wd needed for gANI
Returns:	Ndb ([‘reference’, ‘querry’, ‘ani’,’coverage’])
Return type:	DataFrame

drep.d_cluster.d_cluster_wrapper(workDirectory, **kwargs)¶

Controller for the dRep cluster operation

Validates arguments, calls cluster_genomes, then stores the output

Keyword Arguments:
Parameters:	wd (WorkDirectory) – The current workDirectory **kwargs – Command line arguments
	processors – Threads to use with checkM / prodigal overwrite – Overwrite existing data in the work folder debug – If True, make extra output when running external scripts MASH_sketch – MASH sketch size S_algorithm – Algorithm for secondary clustering comaprisons {ANImf,gANI,ANIn} n_PRESET – Presets to pass to nucmer {normal, tight} P_ANI – ANI threshold to form primary (MASH) clusters (default: 0.9) S_ANI – ANI threshold to form secondary clusters (default: 0.99) SkipMash – Skip MASH clustering, just do secondary clustering on all genomes SkipSecondary – Skip secondary clustering, just perform MASH clustering COV_THRESH – Minmum level of overlap between genomes when doing secondary comparisons (default: 0.1) coverage_method – Method to calculate coverage of an alignment {total,larger} CLUSTERALG – Algorithm used to cluster genomes (passed to scipy.cluster.hierarchy.linkage (default: average) genomes – genomes to cluster in .fasta format. Not necessary if already loaded sequences with the “filter” operation
Returns:	Stores Cdb, Mdb, Ndb, Bdb

drep.d_cluster.estimate_time(comps, alg)¶

Estimate time, in minutes, based on comparison algorithm and number of comparisons

Parameters:	comps – number of genomes comparisons to perform alg – algorthm used
Returns:	time to perfom comparison (in minutes)
Return type:	float

drep.d_cluster.gen_animf_cmd(prefix, ref, querry, **kwargs)¶

return animf command. It will be a single string, with the format: “nucmer cmd ; filter cmd”

Keyword Arguments:
Parameters:	prefix – desired file output name ref – location of reference genome querry – location of querry genome
	all – passed on to gen_nucmer_cmd
Returns:	format is “nucmer cmd ; filter cmd”
Return type:	string

drep.d_cluster.gen_filter_cmd(delta, out)¶

return delta-filter command

Parameters:	delta – desired .delta file to filter out – desired output file
Returns:	cmd to run
Return type:	list

drep.d_cluster.gen_gANI_cmd(file, g1, g2, exe)¶

Generate a command to run gANI (ANIcalculator)

Will return [] if g1 == g2

Parameters:	file – output file name g1 – location of the genes of genome1 g2 – location of the genes of genome2 exe – location of gANI executible
Returns:	command to run
Return type:	list

drep.d_cluster.gen_nucmer_cmd(prefix, ref, querry, c='65', noextend=False, maxgap='90', method='mum')¶

Generate command to run with nucmer

Keyword Arguments:
Parameters:	prefix – desired output name (will have .delta appended) ref – location of reference genome querry – location of querry genomes
	c – c value noextend – either True or False maxgap – maxgap method – detault is ‘mum’
Returns:	command to run number
Return type:	list

drep.d_cluster.genome_hierarchical_clustering(Ndb, **kwargs)¶

Cluster ANI database

Keyword Arguments:
Parameters:	Ndb – result of secondary clustering
	clusterAlg – how to cluster the database (default = single) S_ani – thershold to cluster at (default = .99) cov_thresh – minumum coverage to be included in clustering (default = .5) cluster – name of the cluster comp_method – comparison algorithm used
Returns:	[Cdb, {cluster:[linkage, linkage_db, arguments]}]
Return type:	list

drep.d_cluster.iteratre_clusters(Bdb, Cdb, id='primary_cluster')¶

An iterator: Given Bdb and Cdb, yeild smaller Bdb’s in the same cluster

Parameters:	Bdb – [genome, location] Cdb – [genome, id] id – what to iterate on (default = ‘primary_cluster’)
Returns:	[d(subset of b), cluster(name of cluster)]
Return type:	list

drep.d_cluster.load_genomes(genome_list)¶

Takes a list of genome locations, returns a pandas dataframe with the locations and the basename of the genomes

Parameters:	genome_list – list of genome locations
Returns:	pandas dataframe with columns [‘genome’, ‘location’]
Return type:	DataFrame

drep.d_cluster.make_linkage_Ndb(Ndb, **kwargs)¶

Filter the Ndb in accordance with the kwargs. Average reciprical ANI values

Keyword Arguments:
Parameters:	Ndb – result of secondary clutsering
	cov_thresh – minimum coverage threshold (default = 0.5)
Returns:	pivoted DataFrame ready for clustering
Return type:	DataFrame

drep.d_cluster.parse_delta(filename)¶

Parse a .delta file from nucmer

Parameters:	filename – location of .delta file
Returns:	[alignment_length, similarity_errors]
Return type:	list

drep.d_cluster.parse_gani_file(file)¶

Parse gANI file, return dictionary of results

Parameters:	file – location of gANI file
Returns:	results in the gANI file
Return type:	dict

drep.d_cluster.process_deltafiles(deltafiles, org_lengths, logger=None, **kwargs)¶

Parse a list of delta files into a pandas dataframe

Files NEED to be named in the format genome1_vs_genome2.delta, or genome1_vs_genome2.filtered.delta

Keyword Arguments:
Parameters:	deltafiles – list of .delta files org_lengths – dictionary of genome to genome length
	coverage_method – default is larger logger – if not None, will log 0 errors
Returns:	information about the alignments
Return type:	DataFrame

drep.d_cluster.process_gani_files(files)¶

From a list of gANI output files, return a parsed DataFrame

Parameters:	list – files
Returns:	Ndb
Return type:	DataFrame

drep.d_cluster.run_pairwise_ANImf(genome_list, ANIn_folder, **kwargs)¶

Given a list of genomes and an output folder, compare all genomes using ANImf

Keyword Arguments:
Parameters:	genome_list – list of locations of genome files ANIn_folder – folder to store the output of comparison
	processors – threads to use debug – if true save extra output wd – needed if debug is True

drep.d_cluster.run_pairwise_ANIn(genome_list, ANIn_folder, **kwargs)¶

Given a list of genomes and an output folder, compare all genomes using ANImf

Keyword Arguments:
Parameters:	genome_list – list of locations of genome files ANIn_folder – folder to store the output of comparison
	processors – threads to use debug – if true save extra output wd – needed if debug is True

drep.d_cluster.run_pairwise_gANI(bdb, gANI_folder, prod_folder, **kwargs)¶

Run pairwise gANI on a list of Genomes

Keyword Arguments:
Parameters:	bdb – DataFrame with [‘genome’, ‘location’] gANI_folder – folder to store gANI output prod_folder – folder containing prodigal output from genomes (will run if needed)
	debug – log all of the commands wd – if you want to log commands, you also need the wd processors – threads to use
Returns:	Ndb for gANI
Return type:	DataFrame

drep.d_choose¶

d_choose - a subset of drep

Choose best genome from each cluster

drep.d_choose.choose_winners(Cdb, Gdb, **kwargs)¶

Make a scoring database and pick the winner of each cluster

Keyword Arguments:
Parameters:	Cdb – clustering database Gdb – genome information database
	wrapper (See) –
Returns:	[Sdb (scoring database), Wdb (winner database)]
Return type:	List

drep.d_choose.d_choose_wrapper(wd, **kwargs)¶

Controller for the dRep choose operation

Based off of the formula: A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size)

A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight

Keyword Arguments:
Parameters:	wd (WorkDirectory) – The current workDirectory **kwargs – Command line arguments
	genomeInfo – .csv genomeInfo file noQualityFiltering – Don’t run checkM or do any quality-based filtering (not recommended) checkM_method – Either lineage_wf (more accurate) or taxonomy_wf (faster) completeness_weight – see formula contamination_weight – see formula strain_heterogeneity_weight – see formula N50_weight – see formula size_weight – see formula
Returns:	Makes Sdb (scoreDb) in the workDirectory

drep.d_choose.pick_winners(Sdb, Cdb)¶

Based on clustering and scores, pick the best genome from every cluster

Parameters:	Sdb – score of every genome Cdb – clustering
Returns:	Wdb (winner database)
Return type:	DataFrame

drep.d_choose.score_genomes(genomes, Gdb, **kwargs)¶

Calculate the scores for a list of genomes

Keyword Arguments:
Parameters:	genomes – list of genomes Gdb – genome information database
	wrapper (See) –
Returns:	Sdb (scoring database)
Return type:	DataFrame

drep.d_choose.score_row(row, **kwargs)¶

Perform the scoring of a row based on kwargs

Keyword Arguments:
Parameters:	row – row of genome information
	noQualityFiltering – Don’t run checkM or do any quality-based filtering (not recommended) completeness_weight – see formula contamination_weight – see formula strain_heterogeneity_weight – see formula N50_weight – see formula size_weight – see formula
Returns:	score
Return type:	float

drep.d_analyze¶

d_analyze - a subset of drep

Make plots based on de-replication

drep.d_analyze.calc_dist(x1, y1, x2, y2)¶

Return distance from two points

Args: self explainatory

Returns:	distance
Return type:	int

drep.d_analyze.cluster_test_wrapper(wd, **kwargs)¶: DEPRICATED

drep.d_analyze.d_analyze_wrapper(wd, **kwargs)¶

Controller for the dRep analyze operation

Keyword Arguments:
Parameters:	wd – The current workDirectory **kwargs – Command line arguments
	plots – List of plots to make [list of ints, 1-6]
Returns:	Makes some plots

drep.d_analyze.fancy_dendrogram(linkage, names, name2color=False, threshold=False, self_thresh=False)¶: Make a fancy dendrogram

drep.d_analyze.gen_color_dictionary(names, name2cluster)¶

Make the dictionary name2color

Parameters:	names – key in the returned dictionary name2cluster – a dictionary of name to it’s cluster
Returns:	name -> color
Return type:	dict

drep.d_analyze.gen_color_list(names, name2cluster)¶: Make a list of colors the same length as names, based on their cluster

drep.d_analyze.get_highest_self(db, genomes, min=0.0001)¶: Return the highest ANI value resulting from comparing a genome to itself

drep.d_analyze.mash_dendrogram_from_wd(wd, plot_dir=False)¶

From the wd and kwargs, call plot_MASH_dendrogram

Parameters:	wd – WorkDirectory plot_dir (optional) – Location to store figure
Returns:	Shows plot, makes a plot in the plot_dir

drep.d_analyze.normalize(df)¶

Normalize all columns in df to 0-1 except ‘genome’ or ‘location’

Parameters:	df – DataFrame
Returns:	Nomralized
Return type:	DataFrame

drep.d_analyze.plot_ANIn_vs_ANIn_cov(Ndb)¶