Appendix
Glossary
This section defines terms used by phyddle:
Term |
Definition |
---|---|
accuracy |
How well the neural network predicts the training example labels. Accuracy for categorical data is the frequency that the predicted label matches the true example label. |
calibration dataset |
Used to calibrate the prediction intervals obtained with conformalized quantile regression to attain desired coverage properties. A subset of examples withheld from the training examples. |
calibration prediction interval (CPI) |
Predicted intervals calibrated to have desired the exact coverage properties (e.g. coverage of p=0.95) by fine-tuning uncalibrated prediction intervals with a dataset of calibration training examples, not used during the training procedure itself. See Romano et al. (2019). |
compact bijective ladderized vector (CBLV or CBLV+S) |
A compact representation for a phylogenetic tree of serially sampled taxa. CBLV ladderizes the vector elements based on whichever clade contains the taxon with the youngest age, and records 2N elements corresponding to node heights. See Voznica et al. (2022). |
compact diversity-reordered vector (CDV or CDV+S) |
A compact representation for a phylogenetic tree of extant-only taxa. CDV ladderizes the vector elements based on whichever clade contains the greatest clade-length (sum of branch lengths), and records N elements corresponding to internal node heights. See Lambert et al. (2022). Note: the original CDV formulation includes a second row for integer-encoded tip-state data for a single binary character. We have replaced this row with the generalizable +S extension for multiple characters/states as described in Thompson et al. (2022). |
compact phylogenetic vector (CPV or CPV+S) |
A compact representation for a phylogenetic tree, encoded either using CBLV or CDV criteria. See Voznica et al. (2022) and Lambert et al. (2022). Any CPV may also include states, yielding CPV+S format. See Thompson et al. (2022). |
conformalized quantile regression |
A machine learning technique for estimating the lower and upper bounds of a prediction interval at a given confidence interval (e.g. p=0.95) that contain the true parameter value with frequency p over the entire dataset of training examples. See Romano et al. (2019). |
convolutional neural network (CNN) |
A neural network with multiple layers designed to summarize spatial information in the data patterns using convolution and pooling transformations. CNNs are specialized for extracting information from translation invariant data patterns, such as images or [in our case] phylogenetic state tensors. |
coverage |
The coverage interval for a new dataset will contain the true parameter value with a given probability (e.g. p=0.95). Assumes relevant datasets were generated under the assumed model. |
epoch |
One training interval for minimizing the loss function. An epoch may contain multiple smaller steps, including stochastic batch sampling. |
feed forward neural network (FFNN) |
A neural network with multiple layers designed to extract information from highly structured data, such as numbers in a data table or [in our case] summary statistics. |
integer encoding |
Representing the value of a K-state categorical variable with a single integer. This representation requires little space but implies an ordering among categories. |
label |
A value to be predicted from a dataset patten by a neural network. Labels may be training examples in the Train step or they may be estimated quantities in the Estimate step. |
loss function |
The function that computes the average distance between the actual label and predicted label values in a training example. Mean squared error (MSE) and mean absolute error (MAE) are commonly used. |
loss |
The value being minimized during training, computed with the loss function. |
mean absolute error |
The mean over all absolute errors between each training example label and the predicted label from the network. |
mean squared error |
The mean over all squared errors between each training example label and the predicted label from the network. |
neural network |
A graphical model composed of extremely large numbers of nodes and vertices with a predictable structure. The structure generally involves a series of layers with dense connectivity between nodes from adjacent layers and no connectivity with nodes in the same layer or non-adjacent layers. |
one-hot encoding |
Representing the value of a K-state categorical variable with K binary values, where only a single one-hot variable marked as 1 while all other one-hot variables are marked as 0. This representation requires more space but eliminates and ordering among categories. |
overtraining |
When the neural network prediction accuracy continues to increase for the training dataset while seeing no improvement in accuracy for the the validation and/or test datasets. |
phylogenetic model |
A stochastic model that defines a set of evolutionary events and rates that can generate (1) a phylogeny, (2) character data, or (3) both (1) and (2). |
phylogenetic-state tensor |
The tensor containing all compact phylogenetic vector + states data. |
project |
Directories sharing information across pipeline stages for a single phyddle analysis. |
simulated replicate |
The dataset generated from a single run of a simulator. |
simulator |
A program that can generate new datasets under a fully specified model. |
step |
A major set of tasks for a phyddle pipeline analysis. The steps are: Simulate, Format, Train, Estimate, and Plot. |
supervised learning |
A method for training a neural network by providing it training examples for how label values are correlated with data values. |
test dataset |
Used to test network prediction accuracy. A subset of examples withheld from the training examples. |
training dataset |
Used to train the network. Includes all remaining training examples that were not used for test, validation, or calibration datasets, and is usually much larger than the other three datasets. |
training examples |
The collected examples of data patterns and corresponding labels used to train the network for its prediction task. |
training |
Minimizing the loss function for a given neural network and training dataset. |
tree width |
The number of columns in a phylogenetic-state tensor. |
undertraining |
When the neural network prediction accuracy can be improved for both the training dataset and other validation and/or test datasets. |
validation dataset |
Used to validate network performance, namely to diagnose overtraining of the network. A subset of examples withheld from the training examples. |
workspace |
Directory that organizes files across all steps, analyses, and projects. |
Table of Settings
This table summarizes all settings currently available in phyddle. The Setting column is the exact name of the string that appears in the configuration file and command-line argument list. The Step(s) identifies all steps that use the setting: [S]imulate, [F]ormat, [T]rain, [E]stimate, and [P]lot. The Type column is the Python variable type expected for the setting. The Description gives a brief description of what the setting does. Visit Overview to learn more about phyddle settings impact different pipeline analysis steps.
Setting |
Step(s) |
Type |
Description |
---|---|---|---|
|
––––– |
str |
Config file name |
|
SFTEP |
str |
Pipeline step(s) defined with (S)imulate, (F)ormat, (T)rain, (E)stimate, (P)lot, or (A)ll |
|
SFTEP |
str |
Verbose output to screen? |
|
––––– |
str |
Write default config file |
|
––––– |
str |
Save and zip a project for sharing |
|
––––– |
str |
Unzip a shared project |
|
––––– |
str |
Remove step directories for a project |
|
––––– |
int |
Number of simulated examples to save with –save_proj |
|
––––– |
str |
Save formatted training examples with –save_proj? (not recommended) |
|
SFTEP |
int |
Number of digits (precision) for numbers in output files |
|
SF––– |
str |
Use parallelization? (recommended) |
|
––TE– |
str |
Use CUDA parallelization? (recommended; requires Nvidia GPU) |
|
SFT–– |
int |
Number of cores for multiprocessing (-N for all but N) |
|
––––– |
–– |
Disable Format/Estimate steps for empirical data? |
|
––––– |
–– |
Disable Format/Estimate steps for simulated data? |
|
SFTEP |
str |
Parent directory for all step directories unless step directory given |
|
SF––– |
str |
Directory for raw simulated data |
|
SF––– |
str |
Directory for raw empirical data |
|
–FTEP |
str |
Directory for tensor-formatted data |
|
–FTEP |
str |
Directory for trained networks and training output |
|
––TEP |
str |
Directory for new datasets and estimates |
|
––––P |
str |
Directory for plotted results |
|
SFTEP |
str |
Directory for logs of analysis metadata |
|
SFTEP |
str |
Prefix for all output unless step prefix given |
|
SF––– |
str |
Prefix for raw simulated data |
|
SF––– |
str |
Prefix for raw empirical data |
|
–FTEP |
str |
Prefix for tensor-formatted data |
|
–FTEP |
str |
Prefix for trained networks and training output |
|
––TEP |
str |
Prefix for estimate results |
|
––––P |
str |
Prefix for plotted results |
|
S–––– |
str |
Simulation command to run single job (see documentation) |
|
S–––– |
str |
Simulation logging style |
|
SF––– |
int |
Start replicate index for simulated training dataset |
|
SF––– |
int |
End replicate index for simulated training dataset |
|
S–––– |
int |
Add more simulations with auto-generated indices |
|
S–––– |
int |
Number of replicates per simulation command |
|
–F––– |
str |
Encode all simulated replicates into tensor? |
|
–FTE– |
int |
Number of characters |
|
–FTE– |
int |
Number of states per character |
|
–F––– |
int |
Minimum number of taxa allowed when formatting |
|
–F––– |
int |
Maximum number of taxa allowed when formatting |
|
–FTE– |
str |
Downsampling strategy taxon count |
|
–FTEP |
int |
Width of phylo-state tensor |
|
–FTE– |
str |
Encoding strategy for tree |
|
–FTE– |
str |
Encoding strategy for branch lengths |
|
–FTE– |
str |
Encoding strategy for character data |
|
–FTE– |
dict |
Model parameters and variables to estimate |
|
–FTE– |
dict |
Model parameters and variables treated as data |
|
–FTE– |
str |
File format for character data |
|
–FTEP |
str |
File format for training example tensors |
|
–F––– |
str |
Save encoded phylogenetic tensor encoding to csv? |
|
––TEP |
int |
Number of training epochs |
|
––TEP |
int |
Number of consecutive validation loss gains before early stopping |
|
––TEP |
int |
Training batch sizes |
|
–FT–– |
float |
Proportion of data used as test examples (assess trained network performance) |
|
––T–– |
float |
Proportion of data used as validation examples (diagnose network overtraining) |
|
––T–– |
float |
Proportion of data used as calibration examples (calibrate CPIs) |
|
––T–– |
float |
Expected coverage percent for calibrated prediction intervals (CPIs) |
|
––T–– |
str |
Use asymmetric (True) or symmetric (False) adjustments for CPIs? |
|
––T–– |
str |
Loss function for real value estimates |
|
––T–– |
str |
Method used for optimizing neural network |
|
–FTEP |
float |
Offset size c when taking ln(x+c) for zero-valued variables |
|
––T–– |
int[] |
Output channel sizes for plain convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Output channel sizes for stride convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Output channel sizes for dilate convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Output channel sizes for dense layers for auxiliary data input |
|
––T–– |
int[] |
Output channel sizes for dense layers for label outputs |
|
––T–– |
int[] |
Kernel sizes for plain convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Kernel sizes for stride convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Kernel sizes for dilate convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Stride sizes for stride convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Dilation sizes for dilate convolutional layers for phylogenetic state input |
|
––––P |
str |
Plotting color for training data elements |
|
––––P |
str |
Plotting color for test data elements |
|
––––P |
str |
Plotting color for validation data elements |
|
––––P |
str |
Plotting color for label elements |
|
––––P |
str |
Plotting color for auxiliary data elements |
|
––––P |
str |
Plotting color for empirical elements |
|
––––P |
int |
Number of examples in scatter plot |
|
––––P |
int |
Minimum number of empirical datasets to plot densities |
|
––––P |
int |
Number of empirical results to plot |
|
––––P |
float |
Scale of Gaussian noise to add to PCA plot |
References
EE Goldberg, LT Lancaster, RH Ree. 2011. Phylogenetic inference of reciprocal effects between geographic range evolution and diversification. Syst Biol 60:451-465. doi: https://doi.org/10.1093/sysbio/syr046
S Lambert, J Voznica, H Morlon. 2022. Deep learning from phylogenies for diversification analyses. bioRxiv. 2022.09.27.509667. doi: https://doi.org/10.1101/2022.09.27.509667
MJ Landis, A Thompson. 2024. phyddle: software for phylogenetic model exploration with deep learning. bioRxiv 2024.08.06.606717.
Y Romano, E Patterson, E Candes. Conformalized quantile regression. Adv NIPS, 32, 2019. doi: https://doi.org/10.1101/2024.08.06.606717
A Thompson, B Liebeskind, EJ Scully, MJ Landis. 2023. Deep learning approaches to viral phylogeography are fast and as robust as likelihood methods to model misspecification. bioRxiv 2023.02.08.527714. doi: https://doi.org/10.1101/2023.02.08.527714
TG Vaughan, AJ Drummond. 2013. A stochastic simulator of birth–death master equations with application to phylodynamics. Mol Biol Evol 30:1480–1493. doi: https://doi.org/10.1093/molbev/mst057
J Voznica, A Zhukova, V Boskova, E Saulnier, F Lemoine, M Moslonka-Lefebvre, O Gascuel. 2022. Deep learning from phylogenies to uncover the epidemiological dynamics of outbreaks. Nat Commun 13:3896. doi: https://doi.org/10.1038/s41467-022-31511-0
About
Thanks for your interest in phyddle. The phyddle project emerged from a phylogenetic deep learning study led by Ammon Thompson (paper). The goal of phyddle is to provide its users with a generalizable pipeline workflow for phylogenetic modeling and deep learning. This hopefully will make it easier for phylogenetic model enthusiasts and developers to explore and apply models that do not have tractable likelihood functions. It’s also intended for use by methods developers who want to characterize how deep learning methods perform under different conditions for standard phylogenetic estimation tasks.
The phyddle project is developed by Michael Landis and Ammon Thompson.
Issues & Feedback
Please use Issues to report bugs or request features that require modifying phyddle source code. Please contact Michael Landis to request troubleshooting support using phyddle.