Configuration
Note
This section describes how to configure settings for a phyddle analysis. Visit Pipeline to learn more about how settings determine the behavior of a phyddle analysis. Visit Glossary to learn more about how phyddle defines different terms.
There are two ways to configure the settings of a phyddle analysis: through a config file or the command line. Command line settings outrank config file settings.
By file
The phyddle config file is a Python dictionary of analysis arguments (args
)
that configure how phyddle pipeline steps behave. Because it’s a Python script,
you can write code within the config file to specify your analysis, if you find
that helpful. The below example defines settings into different blocks based on
which pipeline step first needs a given setting. However, any setting might be
used by different pipeline steps, so we concatenate all settings into a single
dictionary called args
, which is then used by all pipeline steps. Settings
configured by file can be adjusted through the command line,
if desired.
Note
By default, phyddle assumes you want to use the config file called
config.py
. Use a different config file by calling, e.g.
phyddle --cfg my_other_config.py
Note
phyddle maintains a number of example config files for different models
and simulation methods. These are organized as project subdirectories
within the ./workspace
directory. For example,
./workspace/bisse_r/config.py
simulates under a BiSSE model
with the R simulation script ./workspace/bisse_r/sim_bisse.R
.
#==============================================================================#
# Config: Example phyddle config file #
# Authors: Michael Landis and Ammon Thompson #
# Date: 230804 #
# Description: Simple BiSSE model #
#==============================================================================#
args = {
#-------------------------------#
# Basic #
#-------------------------------#
'step' : 'SFTEP', # Pipeline step(s) defined with
# (S)imulate, (F)ormat, (T)rain,
# (E)stimate, (P)lot, or (A)ll
'verbose' : 'T', # Verbose output to screen?
#-------------------------------#
# Analysis #
#-------------------------------#
'use_parallel' : 'T', # Use parallelization? (recommended)
'num_proc' : -2, # Number of cores for multiprocessing
# (-N for all but N)
'use_cuda' : 'T', # Use CUDA parallelization?
# (recommended; requires Nvidia GPU)
#-------------------------------#
# Workspace #
#-------------------------------#
'dir' : './', # Base directory for all step directories
'prefix' : 'out', # Prefix for all output unless step prefix given
#-------------------------------#
# Simulate #
#-------------------------------#
'sim_command' : f'Rscript ./sim_bisse.R', # Simulation command to run single
# job (see documentation)
'sim_logging' : 'verbose', # Simulation logging style
'start_idx' : 0, # Start index for simulated training replicates
'end_idx' : 1000, # End index for simulated training replicates
'sim_batch_size' : 10, # Number of replicates per simulation command
#-------------------------------#
# Format #
#-------------------------------#
'encode_all_sim' : 'T', # Encode all simulated replicates into tensor?
'num_char' : 1, # Number of characters
'num_states' : 2, # Number of states per character
'min_num_taxa' : 10, # Minimum number of taxa allowed when formatting
'max_num_taxa' : 500, # Maximum number of taxa allowed when formatting
'downsample_taxa' : 'uniform', # Downsampling strategy taxon count
'tree_width' : 500, # Width of phylo-state tensor
'tree_encode' : 'extant', # Encoding strategy for tree
'brlen_encode' : 'height_brlen', # Encoding strategy for branch lengths
'char_encode' : 'integer', # Encoding strategy for character data
'param_est' : { # Unknown model parameters to estimate
'log10_birth_1' : 'num',
'log10_birth_2' : 'num',
'log10_death' : 'num',
'log10_state_rate' : 'num',
'model_type' : 'cat',
'root_state' : 'cat'
],
'param_data' : { # Known model parameters to treat as aux. data
'sample_frac' : 'num'
},
'char_format' : 'csv', # File format for character data
'tensor_format' : 'hdf5', # File format for training example tensors
#-------------------------------#
# Train #
#-------------------------------#
'num_epochs' : 20, # Number of training epochs
'trn_batch_size' : 2048, # Training batch sizes
'prop_test' : 0.05, # Proportion of data used as test examples
# (to assess trained network performance)
'prop_val' : 0.05, # Proportion of data used as validation examples
# (to diagnose network overtraining)
'prop_cal' : 0.2, # Proportion of data used as calibration examples
# (to calibrate CPIs)
'cpi_coverage' : 0.95, # Expected coverage percent for calibrated
# prediction intervals (CPIs)
'cpi_asymmetric' : 'T', # Use asymmetric (True) or symmetric (False)
# adjustments for CPIs?
'loss' : 'mae', # Loss function for optimization
'optimizer' : 'adam', # Method used for optimizing neural network
'phy_channel_plain' : [64, 96, 128], # Output channel sizes for plain convolutional
# layers for phylogenetic state input
'phy_channel_stride' : [64, 96], # Output channel sizes for stride convolutional
# layers for phylogenetic state input
'phy_channel_dilate' : [32, 64], # Output channel sizes for dilate convolutional
# layers for phylogenetic state input
'aux_channel' : [128, 64, 32], # Output channel sizes for dense layers for
# auxiliary data input
'lbl_channel' : [128, 64, 32], # Output channel sizes for dense layers for
# label outputs
'phy_kernel_plain' : [3, 5, 7], # Kernel sizes for plain convolutional layers
# for phylogenetic state input
'phy_kernel_stride' : [7, 9], # Kernel sizes for stride convolutional layers
# for phylogenetic state input
'phy_kernel_dilate' : [3, 5], # Kernel sizes for dilate convolutional layers
# for phylogenetic state input
'phy_stride_stride' : [3, 6], # Stride sizes for stride convolutional layers
# for phylogenetic state input
'phy_dilate_dilate' : [3, 5], # Dilation sizes for dilate convolutional layers
# for phylogenetic state input
#-------------------------------#
# Estimate #
#-------------------------------#
# not currently used
#-------------------------------#
# Plot #
#-------------------------------#
'plot_train_color' : 'blue', # Plotting color for training data elements
'plot_label_color' : 'orange', # Plotting color for training label elements
'plot_test_color' : 'purple', # Plotting color for test data elements
'plot_val_color' : 'red', # Plotting color for validation data elements
'plot_aux_color' : 'green', # Plotting color for auxiliary data elements
'plot_emp_color' : 'black', # Plotting color for empirical elements
'plot_num_scatter' : 50, # Number of examples in scatter plot
'plot_min_emp' : 5, # Minimum number of empirical datasets to plot densities
'plot_num_emp' : 10 # Number of empirical results to plot
}
Via command line
Settings applied through a config file can be overwritten
by setting options when running phyddle from the command line. The names of
settings are the same for the command line options and in the config file.
Using command line options makes it easy to adjust the behavior of pipeline
steps without needing to edit the config file. List all settings that can be
adjusted with the command line using the --help
option:
usage: phyddle [-h] [-c] [-s] [-v] [--make_cfg ] [--save_proj ] [--load_proj ]
[--clean_proj ] [--save_num_sim] [--save_train_fmt]
[--output_precision] [--use_parallel] [--use_cuda] [--num_proc]
[--no_emp] [--no_sim] [--dir] [--sim_dir] [--emp_dir]
[--fmt_dir] [--trn_dir] [--est_dir] [--plt_dir] [--log_dir]
[--prefix] [--sim_prefix] [--emp_prefix] [--fmt_prefix]
[--trn_prefix] [--est_prefix] [--plt_prefix] [--sim_command]
[--sim_logging {clean,compress,verbose}] [--start_idx]
[--end_idx] [--sim_more] [--sim_batch_size] [--encode_all_sim]
[--num_char] [--num_states] [--min_num_taxa] [--max_num_taxa]
[--downsample_taxa {uniform}] [--tree_width]
[--tree_encode {extant,serial}]
[--brlen_encode {height_only,height_brlen}]
[--char_encode {one_hot,integer,numeric}] [--param_est]
[--param_data] [--char_format {csv,nexus}]
[--tensor_format {csv,hdf5}] [--save_phyenc_csv] [--num_epochs]
[--num_early_stop] [--trn_batch_size] [--prop_test]
[--prop_val] [--prop_cal] [--cpi_coverage] [--cpi_asymmetric]
[--loss_numerical {mse,mae}] [--optimizer {adam}]
[--log_offset] [--phy_channel_plain] [--phy_channel_stride]
[--phy_channel_dilate] [--aux_channel] [--lbl_channel]
[--phy_kernel_plain] [--phy_kernel_stride]
[--phy_kernel_dilate] [--phy_stride_stride]
[--phy_dilate_dilate] [--plot_train_color] [--plot_test_color]
[--plot_val_color] [--plot_label_color] [--plot_aux_color]
[--plot_emp_color] [--plot_num_scatter] [--plot_min_emp]
[--plot_num_emp] [--plot_pca_noise]
Software to fiddle around with deep learning for phylogenetic models
options:
-h, --help show this help message and exit
-c, --cfg Config file name
-s, --step Pipeline step(s) defined with (S)imulate, (F)ormat,
(T)rain, (E)stimate, (P)lot, or (A)ll
-v, --verbose Verbose output to screen?
--make_cfg Write default config file
--save_proj Save and zip a project for sharing
--load_proj Unzip a shared project
--clean_proj Remove step directories for a project
--save_num_sim Number of simulated examples to save with --save_proj
--save_train_fmt Save formatted training examples with --save_proj?
(not recommended)
--output_precision Number of digits (precision) for numbers in output
files
--use_parallel Use parallelization? (recommended)
--use_cuda Use CUDA parallelization? (recommended; requires
Nvidia GPU)
--num_proc Number of cores for multiprocessing (-N for all but N)
--no_emp Disable Format/Estimate steps for empirical data?
--no_sim Disable Format/Estimate steps for simulated data?
--dir Parent directory for all step directories unless step
directory given
--sim_dir Directory for raw simulated data
--emp_dir Directory for raw empirical data
--fmt_dir Directory for tensor-formatted data
--trn_dir Directory for trained networks and training output
--est_dir Directory for new datasets and estimates
--plt_dir Directory for plotted results
--log_dir Directory for logs of analysis metadata
--prefix Prefix for all output unless step prefix given
--sim_prefix Prefix for raw simulated data
--emp_prefix Prefix for raw empirical data
--fmt_prefix Prefix for tensor-formatted data
--trn_prefix Prefix for trained networks and training output
--est_prefix Prefix for estimate results
--plt_prefix Prefix for plotted results
--sim_command Simulation command to run single job (see documentation)
--sim_logging {clean,compress,verbose}
Simulation logging style
--start_idx Start replicate index for simulated training dataset
--end_idx End replicate index for simulated training dataset
--sim_more Add more simulations with auto-generated indices
--sim_batch_size Number of replicates per simulation command
--encode_all_sim Encode all simulated replicates into tensor?
--num_char Number of characters
--num_states Number of states per character
--min_num_taxa Minimum number of taxa allowed when formatting
--max_num_taxa Maximum number of taxa allowed when formatting
--downsample_taxa {uniform}
Downsampling strategy taxon count
--tree_width Width of phylo-state tensor
--tree_encode {extant,serial}
Encoding strategy for tree
--brlen_encode {height_only,height_brlen}
Encoding strategy for branch lengths
--char_encode {one_hot,integer,numeric}
Encoding strategy for character data
--param_est Model parameters and variables to estimate
--param_data Model parameters and variables treated as data
--char_format {csv,nexus}
File format for character data
--tensor_format {csv,hdf5}
File format for training example tensors
--save_phyenc_csv Save encoded phylogenetic tensor encoding to csv?
--num_epochs Number of training epochs
--num_early_stop Number of consecutive validation loss gains before
early stopping
--trn_batch_size Training batch sizes
--prop_test Proportion of data used as test examples (assess
trained network performance)
--prop_val Proportion of data used as validation examples
(diagnose network overtraining)
--prop_cal Proportion of data used as calibration examples
(calibrate CPIs)
--cpi_coverage Expected coverage percent for calibrated prediction
intervals (CPIs)
--cpi_asymmetric Use asymmetric (True) or symmetric (False) adjustments
for CPIs?
--loss_numerical {mse,mae}
Loss function for real value estimates
--optimizer {adam} Method used for optimizing neural network
--log_offset Offset size c when taking ln(x+c) for zero-valued
variables
--phy_channel_plain Output channel sizes for plain convolutional layers
for phylogenetic state input
--phy_channel_stride
Output channel sizes for stride convolutional layers
for phylogenetic state input
--phy_channel_dilate
Output channel sizes for dilate convolutional layers
for phylogenetic state input
--aux_channel Output channel sizes for dense layers for auxiliary
data input
--lbl_channel Output channel sizes for dense layers for label
outputs
--phy_kernel_plain Kernel sizes for plain convolutional layers for
phylogenetic state input
--phy_kernel_stride Kernel sizes for stride convolutional layers for
phylogenetic state input
--phy_kernel_dilate Kernel sizes for dilate convolutional layers for
phylogenetic state input
--phy_stride_stride Stride sizes for stride convolutional layers for
phylogenetic state input
--phy_dilate_dilate Dilation sizes for dilate convolutional layers for
phylogenetic state input
--plot_train_color Plotting color for training data elements
--plot_test_color Plotting color for test data elements
--plot_val_color Plotting color for validation data elements
--plot_label_color Plotting color for label elements
--plot_aux_color Plotting color for auxiliary data elements
--plot_emp_color Plotting color for empirical elements
--plot_num_scatter Number of examples in scatter plot
--plot_min_emp Minimum number of empirical datasets to plot densities
--plot_num_emp Number of empirical results to plot
--plot_pca_noise Scale of Gaussian noise to add to PCA plot
Table summary
This section summarizes available settings in phyddle. The Setting column is the exact name of the string that appears in the configuration file and command-line argument list. The Step(s) identifies all steps that use the setting: [S]imulate, [F]ormat, [T]rain, [E]stimate, and [P]lot. The Type column is the Python variable type expected for the setting. The Description gives a brief description of what the setting does. Visit Pipeline to learn more about phyddle settings impact different pipeline analysis steps.
Setting |
Step(s) |
Type |
Description |
---|---|---|---|
|
––––– |
str |
Config file name |
|
SFTEP |
str |
Pipeline step(s) defined with (S)imulate, (F)ormat, (T)rain, (E)stimate, (P)lot, or (A)ll |
|
SFTEP |
str |
Verbose output to screen? |
|
––––– |
str |
Write default config file |
|
––––– |
str |
Save and zip a project for sharing |
|
––––– |
str |
Unzip a shared project |
|
––––– |
str |
Remove step directories for a project |
|
––––– |
int |
Number of simulated examples to save with –save_proj |
|
––––– |
str |
Save formatted training examples with –save_proj? (not recommended) |
|
SFTEP |
int |
Number of digits (precision) for numbers in output files |
|
SF––– |
str |
Use parallelization? (recommended) |
|
––TE– |
str |
Use CUDA parallelization? (recommended; requires Nvidia GPU) |
|
SFT–– |
int |
Number of cores for multiprocessing (-N for all but N) |
|
––––– |
–– |
Disable Format/Estimate steps for empirical data? |
|
––––– |
–– |
Disable Format/Estimate steps for simulated data? |
|
SFTEP |
str |
Parent directory for all step directories unless step directory given |
|
SF––– |
str |
Directory for raw simulated data |
|
SF––– |
str |
Directory for raw empirical data |
|
–FTEP |
str |
Directory for tensor-formatted data |
|
–FTEP |
str |
Directory for trained networks and training output |
|
––TEP |
str |
Directory for new datasets and estimates |
|
––––P |
str |
Directory for plotted results |
|
SFTEP |
str |
Directory for logs of analysis metadata |
|
SFTEP |
str |
Prefix for all output unless step prefix given |
|
SF––– |
str |
Prefix for raw simulated data |
|
SF––– |
str |
Prefix for raw empirical data |
|
–FTEP |
str |
Prefix for tensor-formatted data |
|
–FTEP |
str |
Prefix for trained networks and training output |
|
––TEP |
str |
Prefix for estimate results |
|
––––P |
str |
Prefix for plotted results |
|
S–––– |
str |
Simulation command to run single job (see documentation) |
|
S–––– |
str |
Simulation logging style |
|
SF––– |
int |
Start replicate index for simulated training dataset |
|
SF––– |
int |
End replicate index for simulated training dataset |
|
S–––– |
int |
Add more simulations with auto-generated indices |
|
S–––– |
int |
Number of replicates per simulation command |
|
–F––– |
str |
Encode all simulated replicates into tensor? |
|
–FTE– |
int |
Number of characters |
|
–FTE– |
int |
Number of states per character |
|
–F––– |
int |
Minimum number of taxa allowed when formatting |
|
–F––– |
int |
Maximum number of taxa allowed when formatting |
|
–FTE– |
str |
Downsampling strategy taxon count |
|
–FTEP |
int |
Width of phylo-state tensor |
|
–FTE– |
str |
Encoding strategy for tree |
|
–FTE– |
str |
Encoding strategy for branch lengths |
|
–FTE– |
str |
Encoding strategy for character data |
|
–FTE– |
dict |
Model parameters and variables to estimate |
|
–FTE– |
dict |
Model parameters and variables treated as data |
|
–FTE– |
str |
File format for character data |
|
–FTEP |
str |
File format for training example tensors |
|
–F––– |
str |
Save encoded phylogenetic tensor encoding to csv? |
|
––TEP |
int |
Number of training epochs |
|
––TEP |
int |
Number of consecutive validation loss gains before early stopping |
|
––TEP |
int |
Training batch sizes |
|
–FT–– |
float |
Proportion of data used as test examples (assess trained network performance) |
|
––T–– |
float |
Proportion of data used as validation examples (diagnose network overtraining) |
|
––T–– |
float |
Proportion of data used as calibration examples (calibrate CPIs) |
|
––T–– |
float |
Expected coverage percent for calibrated prediction intervals (CPIs) |
|
––T–– |
str |
Use asymmetric (True) or symmetric (False) adjustments for CPIs? |
|
––T–– |
str |
Loss function for real value estimates |
|
––T–– |
str |
Method used for optimizing neural network |
|
–FTEP |
float |
Offset size c when taking ln(x+c) for zero-valued variables |
|
––T–– |
int[] |
Output channel sizes for plain convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Output channel sizes for stride convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Output channel sizes for dilate convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Output channel sizes for dense layers for auxiliary data input |
|
––T–– |
int[] |
Output channel sizes for dense layers for label outputs |
|
––T–– |
int[] |
Kernel sizes for plain convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Kernel sizes for stride convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Kernel sizes for dilate convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Stride sizes for stride convolutional layers for phylogenetic state input |
|
––T–– |
int[] |
Dilation sizes for dilate convolutional layers for phylogenetic state input |
|
––––P |
str |
Plotting color for training data elements |
|
––––P |
str |
Plotting color for test data elements |
|
––––P |
str |
Plotting color for validation data elements |
|
––––P |
str |
Plotting color for label elements |
|
––––P |
str |
Plotting color for auxiliary data elements |
|
––––P |
str |
Plotting color for empirical elements |
|
––––P |
int |
Number of examples in scatter plot |
|
––––P |
int |
Minimum number of empirical datasets to plot densities |
|
––––P |
int |
Number of empirical results to plot |
|
––––P |
float |
Scale of Gaussian noise to add to PCA plot |
Details
This section provides detailed descriptions for several settings that are not intuitive to specify, but very powerful when used correctly.
Step
The step
setting controls which steps should be applied.
Each pipeline step is represented by a capital letter:
S
for Simulate, F
for Format, T
for Train,
E
for Estimate, P
for Plot, and A
for all steps.
For example, the following two commands are equivalent
phyddle --step A
phyddle -s SFTEP
whereas calling
phyddle -s SF
commands phyddle to perform the Simulate and Format steps, but not the Train, Estimate, or Plot steps.
Step directories
A standard phyddle analysis assumes all work is stored within a single project directory. Work from each step, however, is stored into different subdirectories.
Customizing the input and output directories among steps allows users to quickly explore alternative pipeline designs while leaving previous pipeline results in place.
The project directory can be set using dir
. During analysis, phyddle will
create subdirectories for each step using default names, as needed. For example,
if dir
is set to the local directory ./
, then a full phyddle analysis
would use the following directories for the analysis:
./simulate # default sim_dir
./empirical # default emp_dir
./format # default fmt_dir
./train # default trn_dir
./estimate # default est_dir
./plot # default plt_dir
./log # default log_dir
Individual step directories can be overriden with custom directory locations.
For example, setting dir
to ./
but setting emp_dir
to
/Users/mlandis/datasets/viburnum
and plt_dir
to
/Users/mlandis/projects/viburnum/results
would cause
phyddle to use the following directories:
./simulate # default sim_dir
/Users/mlandis/datasets/viburnum # custom emp_dir
./format # default fmt_dir
./train # default trn_dir
./estimate # default est_dir
/Users/mlandis/projects/viburnum/results # custom plt_dir
./log # default log_dir
Step prefixes
Standard phyddle analyses assume that the files generated by each pipeline
step begin with the filename prefix 'out'
.
The filename prefix for all pipeline steps can be changed using the prefix
settings. Changing the filename prefix allows you to generate alternative
pipeline filesets without overwriting previous results.
As with the pipeline directory settings (above), prefixes for individual pipeline steps can be overridden with custom prefixes. This allows you to compare pipeline performance using different settings, while saving previous work. For example,
phyddle -c config.py \ # load config
-s TE \ # run Train and Estimate steps
--prefix new \ # T & E output has prefix 'new'
--fmt_prefix out \ # Format input has prefix 'out'
--num_epochs 50 \ # Train for 50 epochs
--trn_batch_size 4096 # Use batch sizes of 4096 samples
no_sim
and no_emp
By default the Format and Estimate steps run in a greedy manner,
against the simulated datasets identified by dir
(or sim_dir
) and
prefix
(or sim_prefix
), and against the empirical datasets identified
by dir
(or emp_dir
) and prefix
(or emp_prefix
), should those
datasets exist.
Setting --no_sim
during a command-line run will instruct phyddle to skip
the Format and Estimate steps for the simulated datasets (i.e. the train and
test datasets).
Setting --no_emp
during a command-line run will instruct phyddle to skip
the Format and Estimate steps for the empirical datasets.
In particular, the --no_sim
flag in particular is useful when you only
need to format new empirical datasets, but do not need to reformat existing
simulated (i.e. training/test) datasets. The flag helps eliminate redundant
formatting tasks during pipeline development.