Appendix

Glossary

This section defines terms used by phyddle:

Term	Definition
accuracy	How well the neural network predicts the training example labels. Accuracy for categorical data is the frequency that the predicted label matches the true example label.
calibration dataset	Used to calibrate the prediction intervals obtained with conformalized quantile regression to attain desired coverage properties. A subset of examples withheld from the training examples.
calibration prediction interval (CPI)	Predicted intervals calibrated to have desired the exact coverage properties (e.g. coverage of p=0.95) by fine-tuning uncalibrated prediction intervals with a dataset of calibration training examples, not used during the training procedure itself. See Romano et al. (2019).
compact bijective ladderized vector (CBLV or CBLV+S)	A compact representation for a phylogenetic tree of serially sampled taxa. CBLV ladderizes the vector elements based on whichever clade contains the taxon with the youngest age, and records 2N elements corresponding to node heights. See Voznica et al. (2022).
compact diversity-reordered vector (CDV or CDV+S)	A compact representation for a phylogenetic tree of extant-only taxa. CDV ladderizes the vector elements based on whichever clade contains the greatest clade-length (sum of branch lengths), and records N elements corresponding to internal node heights. See Lambert et al. (2022). Note: the original CDV formulation includes a second row for integer-encoded tip-state data for a single binary character. We have replaced this row with the generalizable +S extension for multiple characters/states as described in Thompson et al. (2022).
compact phylogenetic vector (CPV or CPV+S)	A compact representation for a phylogenetic tree, encoded either using CBLV or CDV criteria. See Voznica et al. (2022) and Lambert et al. (2022). Any CPV may also include states, yielding CPV+S format. See Thompson et al. (2022).
conformalized quantile regression	A machine learning technique for estimating the lower and upper bounds of a prediction interval at a given confidence interval (e.g. p=0.95) that contain the true parameter value with frequency p over the entire dataset of training examples. See Romano et al. (2019).
convolutional neural network (CNN)	A neural network with multiple layers designed to summarize spatial information in the data patterns using convolution and pooling transformations. CNNs are specialized for extracting information from translation invariant data patterns, such as images or [in our case] phylogenetic state tensors.
coverage	The coverage interval for a new dataset will contain the true parameter value with a given probability (e.g. p=0.95). Assumes relevant datasets were generated under the assumed model.
epoch	One training interval for minimizing the loss function. An epoch may contain multiple smaller steps, including stochastic batch sampling.
feed forward neural network (FFNN)	A neural network with multiple layers designed to extract information from highly structured data, such as numbers in a data table or [in our case] summary statistics.
integer encoding	Representing the value of a K-state categorical variable with a single integer. This representation requires little space but implies an ordering among categories.
label	A value to be predicted from a dataset patten by a neural network. Labels may be training examples in the Train step or they may be estimated quantities in the Estimate step.
loss function	The function that computes the average distance between the actual label and predicted label values in a training example. Mean squared error (MSE) and mean absolute error (MAE) are commonly used.
loss	The value being minimized during training, computed with the loss function.
mean absolute error	The mean over all absolute errors between each training example label and the predicted label from the network.
mean squared error	The mean over all squared errors between each training example label and the predicted label from the network.
neural network	A graphical model composed of extremely large numbers of nodes and vertices with a predictable structure. The structure generally involves a series of layers with dense connectivity between nodes from adjacent layers and no connectivity with nodes in the same layer or non-adjacent layers.
one-hot encoding	Representing the value of a K-state categorical variable with K binary values, where only a single one-hot variable marked as 1 while all other one-hot variables are marked as 0. This representation requires more space but eliminates and ordering among categories.
overtraining	When the neural network prediction accuracy continues to increase for the training dataset while seeing no improvement in accuracy for the the validation and/or test datasets.
phylogenetic model	A stochastic model that defines a set of evolutionary events and rates that can generate (1) a phylogeny, (2) character data, or (3) both (1) and (2).
phylogenetic-state tensor	The tensor containing all compact phylogenetic vector + states data.
project	Directories sharing information across pipeline stages for a single phyddle analysis.
simulated replicate	The dataset generated from a single run of a simulator.
simulator	A program that can generate new datasets under a fully specified model.
step	A major set of tasks for a phyddle pipeline analysis. The steps are: Simulate, Format, Train, Estimate, and Plot.
supervised learning	A method for training a neural network by providing it training examples for how label values are correlated with data values.
test dataset	Used to test network prediction accuracy. A subset of examples withheld from the training examples.
training dataset	Used to train the network. Includes all remaining training examples that were not used for test, validation, or calibration datasets, and is usually much larger than the other three datasets.
training examples	The collected examples of data patterns and corresponding labels used to train the network for its prediction task.
training	Minimizing the loss function for a given neural network and training dataset.
tree width	The number of columns in a phylogenetic-state tensor.
undertraining	When the neural network prediction accuracy can be improved for both the training dataset and other validation and/or test datasets.
validation dataset	Used to validate network performance, namely to diagnose overtraining of the network. A subset of examples withheld from the training examples.
workspace	Directory that organizes files across all steps, analyses, and projects.

Table of Settings

This table summarizes all settings currently available in phyddle. The Setting column is the exact name of the string that appears in the configuration file and command-line argument list. The Step(s) identifies all steps that use the setting: [S]imulate, [F]ormat, [T]rain, [E]stimate, and [P]lot. The Type column is the Python variable type expected for the setting. The Description gives a brief description of what the setting does. Visit Overview to learn more about phyddle settings impact different pipeline analysis steps.

Table 1 phyddle settings
Setting	Step(s)	Type	Description
`cfg`	–––––	str	Config file name
`step`	SFTEP	str	Pipeline step(s) defined with (S)imulate, (F)ormat, (T)rain, (E)stimate, (P)lot, or (A)ll
`verbose`	SFTEP	str	Verbose output to screen?
`make_cfg`	–––––	str	Write default config file
`save_proj`	–––––	str	Save and zip a project for sharing
`load_proj`	–––––	str	Unzip a shared project
`clean_proj`	–––––	str	Remove step directories for a project
`save_num_sim`	–––––	int	Number of simulated examples to save with –save_proj
`save_train_fmt`	–––––	str	Save formatted training examples with –save_proj? (not recommended)
`output_precision`	SFTEP	int	Number of digits (precision) for numbers in output files
`use_parallel`	SF–––	str	Use parallelization? (recommended)
`use_cuda`	––TE–	str	Use CUDA parallelization? (recommended; requires Nvidia GPU)
`num_proc`	SFT––	int	Number of cores for multiprocessing (-N for all but N)
`no_emp`	–––––	––	Disable Format/Estimate steps for empirical data?
`no_sim`	–––––	––	Disable Format/Estimate steps for simulated data?
`dir`	SFTEP	str	Parent directory for all step directories unless step directory given
`sim_dir`	SF–––	str	Directory for raw simulated data
`emp_dir`	SF–––	str	Directory for raw empirical data
`fmt_dir`	–FTEP	str	Directory for tensor-formatted data
`trn_dir`	–FTEP	str	Directory for trained networks and training output
`est_dir`	––TEP	str	Directory for new datasets and estimates
`plt_dir`	––––P	str	Directory for plotted results
`log_dir`	SFTEP	str	Directory for logs of analysis metadata
`prefix`	SFTEP	str	Prefix for all output unless step prefix given
`sim_prefix`	SF–––	str	Prefix for raw simulated data
`emp_prefix`	SF–––	str	Prefix for raw empirical data
`fmt_prefix`	–FTEP	str	Prefix for tensor-formatted data
`trn_prefix`	–FTEP	str	Prefix for trained networks and training output
`est_prefix`	––TEP	str	Prefix for estimate results
`plt_prefix`	––––P	str	Prefix for plotted results
`sim_command`	S––––	str	Simulation command to run single job (see documentation)
`sim_logging`	S––––	str	Simulation logging style
`start_idx`	SF–––	int	Start replicate index for simulated training dataset
`end_idx`	SF–––	int	End replicate index for simulated training dataset
`sim_more`	S––––	int	Add more simulations with auto-generated indices
`sim_batch_size`	S––––	int	Number of replicates per simulation command
`encode_all_sim`	–F–––	str	Encode all simulated replicates into tensor?
`num_char`	–FTE–	int	Number of characters
`num_states`	–FTE–	int	Number of states per character
`min_num_taxa`	–F–––	int	Minimum number of taxa allowed when formatting
`max_num_taxa`	–F–––	int	Maximum number of taxa allowed when formatting
`downsample_taxa`	–FTE–	str	Downsampling strategy taxon count
`tree_width`	–FTEP	int	Width of phylo-state tensor
`tree_encode`	–FTE–	str	Encoding strategy for tree
`brlen_encode`	–FTE–	str	Encoding strategy for branch lengths
`char_encode`	–FTE–	str	Encoding strategy for character data
`param_est`	–FTE–	dict	Model parameters and variables to estimate
`param_data`	–FTE–	dict	Model parameters and variables treated as data
`char_format`	–FTE–	str	File format for character data
`tensor_format`	–FTEP	str	File format for training example tensors
`save_phyenc_csv`	–F–––	str	Save encoded phylogenetic tensor encoding to csv?
`num_epochs`	––TEP	int	Number of training epochs
`num_early_stop`	––TEP	int	Number of consecutive validation loss gains before early stopping
`trn_batch_size`	––TEP	int	Training batch sizes
`prop_test`	–FT––	float	Proportion of data used as test examples (assess trained network performance)
`prop_val`	––T––	float	Proportion of data used as validation examples (diagnose network overtraining)
`prop_cal`	––T––	float	Proportion of data used as calibration examples (calibrate CPIs)
`cpi_coverage`	––T––	float	Expected coverage percent for calibrated prediction intervals (CPIs)
`cpi_asymmetric`	––T––	str	Use asymmetric (True) or symmetric (False) adjustments for CPIs?
`loss_numerical`	––T––	str	Loss function for real value estimates
`optimizer`	––T––	str	Method used for optimizing neural network
`log_offset`	–FTEP	float	Offset size c when taking ln(x+c) for zero-valued variables
`phy_channel_plain`	––T––	int[]	Output channel sizes for plain convolutional layers for phylogenetic state input
`phy_channel_stride`	––T––	int[]	Output channel sizes for stride convolutional layers for phylogenetic state input
`phy_channel_dilate`	––T––	int[]	Output channel sizes for dilate convolutional layers for phylogenetic state input
`aux_channel`	––T––	int[]	Output channel sizes for dense layers for auxiliary data input
`lbl_channel`	––T––	int[]	Output channel sizes for dense layers for label outputs
`phy_kernel_plain`	––T––	int[]	Kernel sizes for plain convolutional layers for phylogenetic state input
`phy_kernel_stride`	––T––	int[]	Kernel sizes for stride convolutional layers for phylogenetic state input
`phy_kernel_dilate`	––T––	int[]	Kernel sizes for dilate convolutional layers for phylogenetic state input
`phy_stride_stride`	––T––	int[]	Stride sizes for stride convolutional layers for phylogenetic state input
`phy_dilate_dilate`	––T––	int[]	Dilation sizes for dilate convolutional layers for phylogenetic state input
`plot_train_color`	––––P	str	Plotting color for training data elements
`plot_test_color`	––––P	str	Plotting color for test data elements
`plot_val_color`	––––P	str	Plotting color for validation data elements
`plot_label_color`	––––P	str	Plotting color for label elements
`plot_aux_color`	––––P	str	Plotting color for auxiliary data elements
`plot_emp_color`	––––P	str	Plotting color for empirical elements
`plot_num_scatter`	––––P	int	Number of examples in scatter plot
`plot_min_emp`	––––P	int	Minimum number of empirical datasets to plot densities
`plot_num_emp`	––––P	int	Number of empirical results to plot
`plot_pca_noise`	––––P	float	Scale of Gaussian noise to add to PCA plot

References

EE Goldberg, LT Lancaster, RH Ree. 2011. Phylogenetic inference of reciprocal effects between geographic range evolution and diversification. Syst Biol 60:451-465. doi: https://doi.org/10.1093/sysbio/syr046

S Lambert, J Voznica, H Morlon. 2022. Deep learning from phylogenies for diversification analyses. bioRxiv. 2022.09.27.509667. doi: https://doi.org/10.1101/2022.09.27.509667

MJ Landis, A Thompson. 2024. phyddle: software for phylogenetic model exploration with deep learning. bioRxiv 2024.08.06.606717.

Y Romano, E Patterson, E Candes. Conformalized quantile regression. Adv NIPS, 32, 2019. doi: https://doi.org/10.1101/2024.08.06.606717

A Thompson, B Liebeskind, EJ Scully, MJ Landis. 2023. Deep learning approaches to viral phylogeography are fast and as robust as likelihood methods to model misspecification. bioRxiv 2023.02.08.527714. doi: https://doi.org/10.1101/2023.02.08.527714

TG Vaughan, AJ Drummond. 2013. A stochastic simulator of birth–death master equations with application to phylodynamics. Mol Biol Evol 30:1480–1493. doi: https://doi.org/10.1093/molbev/mst057

J Voznica, A Zhukova, V Boskova, E Saulnier, F Lemoine, M Moslonka-Lefebvre, O Gascuel. 2022. Deep learning from phylogenies to uncover the epidemiological dynamics of outbreaks. Nat Commun 13:3896. doi: https://doi.org/10.1038/s41467-022-31511-0

About

Thanks for your interest in phyddle. The phyddle project emerged from a phylogenetic deep learning study led by Ammon Thompson (paper). The goal of phyddle is to provide its users with a generalizable pipeline workflow for phylogenetic modeling and deep learning. This hopefully will make it easier for phylogenetic model enthusiasts and developers to explore and apply models that do not have tractable likelihood functions. It’s also intended for use by methods developers who want to characterize how deep learning methods perform under different conditions for standard phylogenetic estimation tasks.

The phyddle project is developed by Michael Landis and Ammon Thompson.

Issues & Feedback

Please use Issues to report bugs or request features that require modifying phyddle source code. Please contact Michael Landis to request troubleshooting support using phyddle.