Configuration

Note

This section describes how to configure settings for a phyddle analysis. Visit Pipeline to learn more about how settings determine the behavior of a phyddle analysis. Visit Glossary to learn more about how phyddle defines different terms.

There are two ways to configure the settings of a phyddle analysis: through a config file or the command line. Command line settings outrank config file settings.

By file

The phyddle config file is a Python dictionary of analysis arguments (args) that configure how phyddle pipeline steps behave. Because it’s a Python script, you can write code within the config file to specify your analysis, if you find that helpful. The below example defines settings into different blocks based on which pipeline step first needs a given setting. However, any setting might be used by different pipeline steps, so we concatenate all settings into a single dictionary called args, which is then used by all pipeline steps. Settings configured by file can be adjusted through the command line, if desired.

Note

By default, phyddle assumes you want to use the config file called config.py. Use a different config file by calling, e.g. phyddle --cfg my_other_config.py

Note

phyddle maintains a number of example config files for different models and simulation methods. These are organized as project subdirectories within the ./workspace directory. For example, ./workspace/bisse_r/config.py simulates under a BiSSE model with the R simulation script ./workspace/bisse_r/sim_bisse.R.

#==============================================================================#
# Config:       Example phyddle config file                                    #
# Authors:      Michael Landis and Ammon Thompson                              #
# Date:         230804                                                         #
# Description:  Simple BiSSE model                                             #
#==============================================================================#

args = {
    #-------------------------------#
    # Basic                         #
    #-------------------------------#
    'step'               : 'SFTEP',        # Pipeline step(s) defined with
                                           #   (S)imulate, (F)ormat, (T)rain,
                                           #   (E)stimate, (P)lot, or (A)ll
    'verbose'            : 'T',            # Verbose output to screen?

    #-------------------------------#
    # Analysis                      #
    #-------------------------------#
    'use_parallel'       : 'T',            # Use parallelization? (recommended)
    'num_proc'           : -2,             # Number of cores for multiprocessing
                                           #   (-N for all but N)
    'use_cuda'           : 'T',            # Use CUDA parallelization?
                                           #   (recommended; requires Nvidia GPU)

    #-------------------------------#
    # Workspace                     #
    #-------------------------------#

    'dir'                : './',           # Base directory for all step directories
    'prefix'             : 'out',                   # Prefix for all output unless step prefix given

    #-------------------------------#
    # Simulate                      #
    #-------------------------------#
    'sim_command'        : f'Rscript ./sim_bisse.R', # Simulation command to run single
                                                              #   job (see documentation)
    'sim_logging'        : 'verbose',                 # Simulation logging style
    'start_idx'          : 0,                         # Start index for simulated training replicates
    'end_idx'            : 1000,                      # End index for simulated training replicates
    'sim_batch_size'     : 10,                        # Number of replicates per simulation command

    #-------------------------------#
    # Format                        #
    #-------------------------------#
    'encode_all_sim'     : 'T',                  # Encode all simulated replicates into tensor?
    'num_char'           : 1,                    # Number of characters
    'num_states'         : 2,                    # Number of states per character
    'min_num_taxa'       : 10,                   # Minimum number of taxa allowed when formatting
    'max_num_taxa'       : 500,                  # Maximum number of taxa allowed when formatting
    'downsample_taxa'    : 'uniform',            # Downsampling strategy taxon count
    'tree_width'         : 500,                  # Width of phylo-state tensor
    'tree_encode'        : 'extant',             # Encoding strategy for tree
    'brlen_encode'       : 'height_brlen',       # Encoding strategy for branch lengths
    'char_encode'        : 'integer',            # Encoding strategy for character data
    'param_est'          : {                     # Unknown model parameters to estimate
        'log10_birth_1'      : 'num',
        'log10_birth_2'      : 'num',
        'log10_death'        : 'num',
        'log10_state_rate'   : 'num',
        'model_type'         : 'cat',
        'root_state'         : 'cat'
    ],
    'param_data'        : {                      # Known model parameters to treat as aux. data
        'sample_frac'        : 'num'
    },
    'char_format'        : 'csv',                # File format for character data
    'tensor_format'      : 'hdf5',               # File format for training example tensors

    #-------------------------------#
    # Train                         #
    #-------------------------------#
    'num_epochs'         : 20,                   # Number of training epochs
    'trn_batch_size'     : 2048,                 # Training batch sizes
    'prop_test'          : 0.05,                 # Proportion of data used as test examples
                                                 #     (to assess trained network performance)
    'prop_val'           : 0.05,                 # Proportion of data used as validation examples
                                                 #     (to diagnose network overtraining)
    'prop_cal'           : 0.2,                  # Proportion of data used as calibration examples
                                                 #     (to calibrate CPIs)
    'cpi_coverage'       : 0.95,                 # Expected coverage percent for calibrated
                                                 #     prediction intervals (CPIs)
    'cpi_asymmetric'     : 'T',                  # Use asymmetric (True) or symmetric (False)
                                                 #     adjustments for CPIs?
    'loss'               : 'mae',                # Loss function for optimization
    'optimizer'          : 'adam',               # Method used for optimizing neural network
    'phy_channel_plain'  : [64, 96, 128],        # Output channel sizes for plain convolutional
                                                 #     layers for phylogenetic state input
    'phy_channel_stride' : [64, 96],             # Output channel sizes for stride convolutional
                                                 #     layers for phylogenetic state input
    'phy_channel_dilate' : [32, 64],             # Output channel sizes for dilate convolutional
                                                 #     layers for phylogenetic state input
    'aux_channel'        : [128, 64, 32],        # Output channel sizes for dense layers for
                                                 #     auxiliary data input
    'lbl_channel'        : [128, 64, 32],        # Output channel sizes for dense layers for
                                                 #     label outputs
    'phy_kernel_plain'   : [3, 5, 7],            # Kernel sizes for plain convolutional layers
                                                 #     for phylogenetic state input
    'phy_kernel_stride'  : [7, 9],               # Kernel sizes for stride convolutional layers
                                                 #     for phylogenetic state input
    'phy_kernel_dilate'  : [3, 5],               # Kernel sizes for dilate convolutional layers
                                                 #     for phylogenetic state input
    'phy_stride_stride'  : [3, 6],               # Stride sizes for stride convolutional layers
                                                 #     for phylogenetic state input
    'phy_dilate_dilate'  : [3, 5],               # Dilation sizes for dilate convolutional layers
                                                 #     for phylogenetic state input

    #-------------------------------#
    # Estimate                      #
    #-------------------------------#
    # not currently used

    #-------------------------------#
    # Plot                          #
    #-------------------------------#
    'plot_train_color'   : 'blue',               # Plotting color for training data elements
    'plot_label_color'   : 'orange',             # Plotting color for training label elements
    'plot_test_color'    : 'purple',             # Plotting color for test data elements
    'plot_val_color'     : 'red',                # Plotting color for validation data elements
    'plot_aux_color'     : 'green',              # Plotting color for auxiliary data elements
    'plot_emp_color'     : 'black',              # Plotting color for empirical elements
    'plot_num_scatter'   : 50,                   # Number of examples in scatter plot
    'plot_min_emp'       : 5,                    # Minimum number of empirical datasets to plot densities
    'plot_num_emp'       : 10                    # Number of empirical results to plot
   }

Via command line

Settings applied through a config file can be overwritten by setting options when running phyddle from the command line. The names of settings are the same for the command line options and in the config file. Using command line options makes it easy to adjust the behavior of pipeline steps without needing to edit the config file. List all settings that can be adjusted with the command line using the --help option:

    usage: phyddle [-h] [-c] [-s] [-v] [--make_cfg ] [--save_proj ] [--load_proj ]
           [--clean_proj ] [--save_num_sim] [--save_train_fmt]
           [--output_precision] [--use_parallel] [--use_cuda] [--num_proc]
           [--no_emp] [--no_sim] [--dir] [--sim_dir] [--emp_dir]
           [--fmt_dir] [--trn_dir] [--est_dir] [--plt_dir] [--log_dir]
           [--prefix] [--sim_prefix] [--emp_prefix] [--fmt_prefix]
           [--trn_prefix] [--est_prefix] [--plt_prefix] [--sim_command]
           [--sim_logging {clean,compress,verbose}] [--start_idx]
           [--end_idx] [--sim_more] [--sim_batch_size] [--encode_all_sim]
           [--num_char] [--num_states] [--min_num_taxa] [--max_num_taxa]
           [--downsample_taxa {uniform}] [--tree_width]
           [--tree_encode {extant,serial}]
           [--brlen_encode {height_only,height_brlen}]
           [--char_encode {one_hot,integer,numeric}] [--param_est]
           [--param_data] [--char_format {csv,nexus}]
           [--tensor_format {csv,hdf5}] [--save_phyenc_csv] [--num_epochs]
           [--num_early_stop] [--trn_batch_size] [--prop_test]
           [--prop_val] [--prop_cal] [--cpi_coverage] [--cpi_asymmetric]
           [--loss_numerical {mse,mae}] [--optimizer {adam}]
           [--log_offset] [--phy_channel_plain] [--phy_channel_stride]
           [--phy_channel_dilate] [--aux_channel] [--lbl_channel]
           [--phy_kernel_plain] [--phy_kernel_stride]
           [--phy_kernel_dilate] [--phy_stride_stride]
           [--phy_dilate_dilate] [--plot_train_color] [--plot_test_color]
           [--plot_val_color] [--plot_label_color] [--plot_aux_color]
           [--plot_emp_color] [--plot_num_scatter] [--plot_min_emp]
           [--plot_num_emp] [--plot_pca_noise]

Software to fiddle around with deep learning for phylogenetic models

options:
  -h, --help            show this help message and exit
  -c, --cfg             Config file name
  -s, --step            Pipeline step(s) defined with (S)imulate, (F)ormat,
                        (T)rain, (E)stimate, (P)lot, or (A)ll
  -v, --verbose         Verbose output to screen?
  --make_cfg            Write default config file
  --save_proj           Save and zip a project for sharing
  --load_proj           Unzip a shared project
  --clean_proj          Remove step directories for a project
  --save_num_sim        Number of simulated examples to save with --save_proj
  --save_train_fmt      Save formatted training examples with --save_proj?
                        (not recommended)
  --output_precision    Number of digits (precision) for numbers in output
                        files
  --use_parallel        Use parallelization? (recommended)
  --use_cuda            Use CUDA parallelization? (recommended; requires
                        Nvidia GPU)
  --num_proc            Number of cores for multiprocessing (-N for all but N)
  --no_emp              Disable Format/Estimate steps for empirical data?
  --no_sim              Disable Format/Estimate steps for simulated data?
  --dir                 Parent directory for all step directories unless step
                        directory given
  --sim_dir             Directory for raw simulated data
  --emp_dir             Directory for raw empirical data
  --fmt_dir             Directory for tensor-formatted data
  --trn_dir             Directory for trained networks and training output
  --est_dir             Directory for new datasets and estimates
  --plt_dir             Directory for plotted results
  --log_dir             Directory for logs of analysis metadata
  --prefix              Prefix for all output unless step prefix given
  --sim_prefix          Prefix for raw simulated data
  --emp_prefix          Prefix for raw empirical data
  --fmt_prefix          Prefix for tensor-formatted data
  --trn_prefix          Prefix for trained networks and training output
  --est_prefix          Prefix for estimate results
  --plt_prefix          Prefix for plotted results
  --sim_command         Simulation command to run single job (see documentation)
  --sim_logging {clean,compress,verbose}
                        Simulation logging style
  --start_idx           Start replicate index for simulated training dataset
  --end_idx             End replicate index for simulated training dataset
  --sim_more            Add more simulations with auto-generated indices
  --sim_batch_size      Number of replicates per simulation command
  --encode_all_sim      Encode all simulated replicates into tensor?
  --num_char            Number of characters
  --num_states          Number of states per character
  --min_num_taxa        Minimum number of taxa allowed when formatting
  --max_num_taxa        Maximum number of taxa allowed when formatting
  --downsample_taxa {uniform}
                        Downsampling strategy taxon count
  --tree_width          Width of phylo-state tensor
  --tree_encode {extant,serial}
                        Encoding strategy for tree
  --brlen_encode {height_only,height_brlen}
                        Encoding strategy for branch lengths
  --char_encode {one_hot,integer,numeric}
                        Encoding strategy for character data
  --param_est           Model parameters and variables to estimate
  --param_data          Model parameters and variables treated as data
  --char_format {csv,nexus}
                        File format for character data
  --tensor_format {csv,hdf5}
                        File format for training example tensors
  --save_phyenc_csv     Save encoded phylogenetic tensor encoding to csv?
  --num_epochs          Number of training epochs
  --num_early_stop      Number of consecutive validation loss gains before
                        early stopping
  --trn_batch_size      Training batch sizes
  --prop_test           Proportion of data used as test examples (assess
                        trained network performance)
  --prop_val            Proportion of data used as validation examples
                        (diagnose network overtraining)
  --prop_cal            Proportion of data used as calibration examples
                        (calibrate CPIs)
  --cpi_coverage        Expected coverage percent for calibrated prediction
                        intervals (CPIs)
  --cpi_asymmetric      Use asymmetric (True) or symmetric (False) adjustments
                        for CPIs?
  --loss_numerical {mse,mae}
                        Loss function for real value estimates
  --optimizer {adam}    Method used for optimizing neural network
  --log_offset          Offset size c when taking ln(x+c) for zero-valued
                        variables
  --phy_channel_plain   Output channel sizes for plain convolutional layers
                        for phylogenetic state input
  --phy_channel_stride
                        Output channel sizes for stride convolutional layers
                        for phylogenetic state input
  --phy_channel_dilate
                        Output channel sizes for dilate convolutional layers
                        for phylogenetic state input
  --aux_channel         Output channel sizes for dense layers for auxiliary
                        data input
  --lbl_channel         Output channel sizes for dense layers for label
                        outputs
  --phy_kernel_plain    Kernel sizes for plain convolutional layers for
                        phylogenetic state input
  --phy_kernel_stride   Kernel sizes for stride convolutional layers for
                        phylogenetic state input
  --phy_kernel_dilate   Kernel sizes for dilate convolutional layers for
                        phylogenetic state input
  --phy_stride_stride   Stride sizes for stride convolutional layers for
                        phylogenetic state input
  --phy_dilate_dilate   Dilation sizes for dilate convolutional layers for
                        phylogenetic state input
  --plot_train_color    Plotting color for training data elements
  --plot_test_color     Plotting color for test data elements
  --plot_val_color      Plotting color for validation data elements
  --plot_label_color    Plotting color for label elements
  --plot_aux_color      Plotting color for auxiliary data elements
  --plot_emp_color      Plotting color for empirical elements
  --plot_num_scatter    Number of examples in scatter plot
  --plot_min_emp        Minimum number of empirical datasets to plot densities
  --plot_num_emp        Number of empirical results to plot
  --plot_pca_noise      Scale of Gaussian noise to add to PCA plot

Table summary

This section summarizes available settings in phyddle. The Setting column is the exact name of the string that appears in the configuration file and command-line argument list. The Step(s) identifies all steps that use the setting: [S]imulate, [F]ormat, [T]rain, [E]stimate, and [P]lot. The Type column is the Python variable type expected for the setting. The Description gives a brief description of what the setting does. Visit Pipeline to learn more about phyddle settings impact different pipeline analysis steps.

phyddle settings
Setting	Step(s)	Type	Description
`cfg`	–––––	str	Config file name
`step`	SFTEP	str	Pipeline step(s) defined with (S)imulate, (F)ormat, (T)rain, (E)stimate, (P)lot, or (A)ll
`verbose`	SFTEP	str	Verbose output to screen?
`make_cfg`	–––––	str	Write default config file
`save_proj`	–––––	str	Save and zip a project for sharing
`load_proj`	–––––	str	Unzip a shared project
`clean_proj`	–––––	str	Remove step directories for a project
`save_num_sim`	–––––	int	Number of simulated examples to save with –save_proj
`save_train_fmt`	–––––	str	Save formatted training examples with –save_proj? (not recommended)
`output_precision`	SFTEP	int	Number of digits (precision) for numbers in output files
`use_parallel`	SF–––	str	Use parallelization? (recommended)
`use_cuda`	––TE–	str	Use CUDA parallelization? (recommended; requires Nvidia GPU)
`num_proc`	SFT––	int	Number of cores for multiprocessing (-N for all but N)
`no_emp`	–––––	––	Disable Format/Estimate steps for empirical data?
`no_sim`	–––––	––	Disable Format/Estimate steps for simulated data?
`dir`	SFTEP	str	Parent directory for all step directories unless step directory given
`sim_dir`	SF–––	str	Directory for raw simulated data
`emp_dir`	SF–––	str	Directory for raw empirical data
`fmt_dir`	–FTEP	str	Directory for tensor-formatted data
`trn_dir`	–FTEP	str	Directory for trained networks and training output
`est_dir`	––TEP	str	Directory for new datasets and estimates
`plt_dir`	––––P	str	Directory for plotted results
`log_dir`	SFTEP	str	Directory for logs of analysis metadata
`prefix`	SFTEP	str	Prefix for all output unless step prefix given
`sim_prefix`	SF–––	str	Prefix for raw simulated data
`emp_prefix`	SF–––	str	Prefix for raw empirical data
`fmt_prefix`	–FTEP	str	Prefix for tensor-formatted data
`trn_prefix`	–FTEP	str	Prefix for trained networks and training output
`est_prefix`	––TEP	str	Prefix for estimate results
`plt_prefix`	––––P	str	Prefix for plotted results
`sim_command`	S––––	str	Simulation command to run single job (see documentation)
`sim_logging`	S––––	str	Simulation logging style
`start_idx`	SF–––	int	Start replicate index for simulated training dataset
`end_idx`	SF–––	int	End replicate index for simulated training dataset
`sim_more`	S––––	int	Add more simulations with auto-generated indices
`sim_batch_size`	S––––	int	Number of replicates per simulation command
`encode_all_sim`	–F–––	str	Encode all simulated replicates into tensor?
`num_char`	–FTE–	int	Number of characters
`num_states`	–FTE–	int	Number of states per character
`min_num_taxa`	–F–––	int	Minimum number of taxa allowed when formatting
`max_num_taxa`	–F–––	int	Maximum number of taxa allowed when formatting
`downsample_taxa`	–FTE–	str	Downsampling strategy taxon count
`tree_width`	–FTEP	int	Width of phylo-state tensor
`tree_encode`	–FTE–	str	Encoding strategy for tree
`brlen_encode`	–FTE–	str	Encoding strategy for branch lengths
`char_encode`	–FTE–	str	Encoding strategy for character data
`param_est`	–FTE–	dict	Model parameters and variables to estimate
`param_data`	–FTE–	dict	Model parameters and variables treated as data
`char_format`	–FTE–	str	File format for character data
`tensor_format`	–FTEP	str	File format for training example tensors
`save_phyenc_csv`	–F–––	str	Save encoded phylogenetic tensor encoding to csv?
`num_epochs`	––TEP	int	Number of training epochs
`num_early_stop`	––TEP	int	Number of consecutive validation loss gains before early stopping
`trn_batch_size`	––TEP	int	Training batch sizes
`prop_test`	–FT––	float	Proportion of data used as test examples (assess trained network performance)
`prop_val`	––T––	float	Proportion of data used as validation examples (diagnose network overtraining)
`prop_cal`	––T––	float	Proportion of data used as calibration examples (calibrate CPIs)
`cpi_coverage`	––T––	float	Expected coverage percent for calibrated prediction intervals (CPIs)
`cpi_asymmetric`	––T––	str	Use asymmetric (True) or symmetric (False) adjustments for CPIs?
`loss_numerical`	––T––	str	Loss function for real value estimates
`optimizer`	––T––	str	Method used for optimizing neural network
`log_offset`	–FTEP	float	Offset size c when taking ln(x+c) for zero-valued variables
`phy_channel_plain`	––T––	int[]	Output channel sizes for plain convolutional layers for phylogenetic state input
`phy_channel_stride`	––T––	int[]	Output channel sizes for stride convolutional layers for phylogenetic state input
`phy_channel_dilate`	––T––	int[]	Output channel sizes for dilate convolutional layers for phylogenetic state input
`aux_channel`	––T––	int[]	Output channel sizes for dense layers for auxiliary data input
`lbl_channel`	––T––	int[]	Output channel sizes for dense layers for label outputs
`phy_kernel_plain`	––T––	int[]	Kernel sizes for plain convolutional layers for phylogenetic state input
`phy_kernel_stride`	––T––	int[]	Kernel sizes for stride convolutional layers for phylogenetic state input
`phy_kernel_dilate`	––T––	int[]	Kernel sizes for dilate convolutional layers for phylogenetic state input
`phy_stride_stride`	––T––	int[]	Stride sizes for stride convolutional layers for phylogenetic state input
`phy_dilate_dilate`	––T––	int[]	Dilation sizes for dilate convolutional layers for phylogenetic state input
`plot_train_color`	––––P	str	Plotting color for training data elements
`plot_test_color`	––––P	str	Plotting color for test data elements
`plot_val_color`	––––P	str	Plotting color for validation data elements
`plot_label_color`	––––P	str	Plotting color for label elements
`plot_aux_color`	––––P	str	Plotting color for auxiliary data elements
`plot_emp_color`	––––P	str	Plotting color for empirical elements
`plot_num_scatter`	––––P	int	Number of examples in scatter plot
`plot_min_emp`	––––P	int	Minimum number of empirical datasets to plot densities
`plot_num_emp`	––––P	int	Number of empirical results to plot
`plot_pca_noise`	––––P	float	Scale of Gaussian noise to add to PCA plot

Details

This section provides detailed descriptions for several settings that are not intuitive to specify, but very powerful when used correctly.

Step

The step setting controls which steps should be applied. Each pipeline step is represented by a capital letter: S for Simulate, F for Format, T for Train, E for Estimate, P for Plot, and A for all steps.

For example, the following two commands are equivalent

phyddle --step A
phyddle -s SFTEP

whereas calling

phyddle -s SF

commands phyddle to perform the Simulate and Format steps, but not the Train, Estimate, or Plot steps.

Step directories

A standard phyddle analysis assumes all work is stored within a single project directory. Work from each step, however, is stored into different subdirectories.

Customizing the input and output directories among steps allows users to quickly explore alternative pipeline designs while leaving previous pipeline results in place.

The project directory can be set using dir. During analysis, phyddle will create subdirectories for each step using default names, as needed. For example, if dir is set to the local directory ./, then a full phyddle analysis would use the following directories for the analysis:

./simulate        # default sim_dir
./empirical       # default emp_dir
./format          # default fmt_dir
./train           # default trn_dir
./estimate        # default est_dir
./plot            # default plt_dir
./log             # default log_dir

Individual step directories can be overriden with custom directory locations. For example, setting dir to ./ but setting emp_dir to /Users/mlandis/datasets/viburnum and plt_dir to /Users/mlandis/projects/viburnum/results would cause phyddle to use the following directories:

./simulate                                # default sim_dir
/Users/mlandis/datasets/viburnum          # custom emp_dir
./format                                  # default fmt_dir
./train                                   # default trn_dir
./estimate                                # default est_dir
/Users/mlandis/projects/viburnum/results  # custom plt_dir
./log                                     # default log_dir

Step prefixes

Standard phyddle analyses assume that the files generated by each pipeline step begin with the filename prefix 'out'.

The filename prefix for all pipeline steps can be changed using the prefix settings. Changing the filename prefix allows you to generate alternative pipeline filesets without overwriting previous results.

As with the pipeline directory settings (above), prefixes for individual pipeline steps can be overridden with custom prefixes. This allows you to compare pipeline performance using different settings, while saving previous work. For example,

phyddle -c config.py \                # load config
        -s TE \                       # run Train and Estimate steps
        --prefix new \                # T & E output has prefix 'new'
        --fmt_prefix out \            # Format input has prefix 'out'
        --num_epochs 50 \             # Train for 50 epochs
        --trn_batch_size 4096         # Use batch sizes of 4096 samples

`no_sim` and `no_emp`

By default the Format and Estimate steps run in a greedy manner, against the simulated datasets identified by dir (or sim_dir) and prefix (or sim_prefix), and against the empirical datasets identified by dir (or emp_dir) and prefix (or emp_prefix), should those datasets exist.

Setting --no_sim during a command-line run will instruct phyddle to skip the Format and Estimate steps for the simulated datasets (i.e. the train and test datasets).

Setting --no_emp during a command-line run will instruct phyddle to skip the Format and Estimate steps for the empirical datasets.

In particular, the --no_sim flag in particular is useful when you only need to format new empirical datasets, but do not need to reformat existing simulated (i.e. training/test) datasets. The flag helps eliminate redundant formatting tasks during pipeline development.

Configuration

By file

Via command line

Table summary

Details

Step

Step directories

Step prefixes

no_sim and no_emp

`no_sim` and `no_emp`