Workspace
This section describes how phyddle organizes files and directories in its workspace. Visit Formats to learn more about file formats. Visit Configuration to learn more about managing directories and projects within a workspace.
Note
We strongly recommend that beginning users follow this general workspace filesystem design. Advanced users should find it is easy to customize locations for config files, simulation scripts, and output directories.
We recommend using the default directory structure for new projects
to simplify project management. By default, dir
is set to ./
and
prefix
is set to out
. Output for each step is then stored in a
subdirectory named after the step and beginning with the appropriate prefix.
For example, results from Simulate are stored in the
simulate
subdirectory of the dir`
directory. If the sim_dir
setting is provided, Simulate results are stored into that exact
directory. For example, if dir
is set to the local directory ./
,
then Simulate results are saved to ./simulate
. If sim_dir
is
set to ../new_project/new_simulations
then Simulate results are
stored there regardless of the dir
setting.
Similarly, if prefix
is set to out
, then all
simulated datasets begin with the prefix out
. If the setting sim_prefix
is set to sim
, then files generated by Simulate begin with the
prefix sim
.
Briefly, the workspace
directory of a typical phyddle project contains
two important files
config.py
that specifies default settings for phyddle analyses in this projectsim_one.R
(or a similar name) that defines a valid simulation script
and seven subdirectories for the pipeline analysis
simulate
contains raw data generated by simulationformat
contains data formatted into tensors for training networkstrain
contains trained networks and diagnosticsestimate
contains new test datasets their estimationsplot
contains figures of training and validation proceduresempirical
contains raw data for your empirical analysislog
contains runtime logs for a phyddle project
This section will assume all steps are using the example
project
bundled with phyddle was generated using the command phyddle --make_cfg
.
phyddle -c ./workspace/example/config.py --end_idx 25000
This corresponds to a 3-region equal-rates GeoSSE model. All directories have
the complete file set, except ./simulate
contains only
20 original examples.
A standard configuration for a project named example
would store pipeline
work into these directories:
./simulate # output of Simulate step
./empirical # your empirical dataset
./format # output of Format step
./train # output of Train step
./estimate # output of Estimate step
./plot # output of Plot step
./log # logs for phyddle analyses
Soon, we give an overview of the standard files and formats corresponding to each pipeline directory. First, we describe a commands that help with workspace management.
You can easily save and share your project workspace with the following command:
cd workspace/example # current directory is root of
# example project directory
phyddle --save_proj example_lite.tar.gz # save project workspace, but
# skip simulated training data
# (faster, smaller)
The resulting zip file (tarball) will contain the config file, the simulation script, and all workspace directories for pipeline steps. Note, the raw and formatted simulated training example datasets tend to be very large, and require substantial time and storage to archive, so they are not fully saved by default. To fully save all workspace project data, add the following options:
phyddle --save_proj example_full.tar.gz \ # save full project, and
--save_train_fmt T \ # include simulated training data
--save_num_sim 1000000 # (slower, larger)
If you share the project with a collaborator or save it on a server, you can load the project for use with the command:
mkdir -p ~/new_workspace/new_project # create new project directory
cd ~/new_workspace/new_project # enter new project directory
phyddle --load_proj example_lite.tar.gz # load project in directory
phyddle -s S --sim_more 10000 # (e.g.) simulate 10,000 training examples
Lastly, you can quickly remove all existing workspace directories, while preserving the config file and simulation scripts, with the following command:
cd workspace/example # enter directory to clean
phyddle --clean_proj # remove all local workspace directories
These are powerful commands, so be careful when using them. They can remove or overwrite files that you want to keep. Master these commands in a safe test directory before applying them to important workspace projects.
simulate
The Simulate step generates raw data from a simulating model that cannot yet be fed to the neural network for training. A typical simulation will produce the following files
./sim.0.tre # tree file
./sim.0.dat.csv # data file
./sim.0.labels.csv # data-generating params
Each tree file contains a simple Newick string. Each data file contains state
data either in Nexus format (.dat.nex) or simple comma-separated value format
(.dat.csv) depending on the setting for char_format
.
format
Applying Format to a directory of simulated datasets will output
tensors containing the entire set of training examples, stored to, e.g.
./format
. If the tensor_format
setting is 'csv'
(Comma-Separated Value, or CSV format), the formatted files are:
./out.empirical.phy_data.csv
./out.empirical.aux_data.csv
./out.empirical.labels.csv
./out.test.phy_data.csv
./out.test.aux_data.csv
./out.test.labels.csv
./out.train.phy_data.csv
./out.train.aux_data.csv
./out.train.labels.csv
where the phy_data.csv files contain one flattened Compact Phylogenetic Vector + States (CPV+S) entry per row, the aux_data.csv files contain one vector of auxiliary data (summary statistics and known parameters) values per row, and labels.csv contains one vector of label (estimated parameters) per row. Each row for each of the CSV files will correspond to a single, matched simulated training example. All files are stored in standard comma-separated value format, making them easily read by standard CSV-reading functions.
If the tensor_format
setting is 'hdf5'
, the resulting files are:
./out.test.hdf5
./out.train.hdf5
./out.empirical.hdf5
where each HDF5 file contains all phylogenetic-state (CPV+S) data, auxiliary data, and label data. Individual simulated training examples share the same set of ordered examples across three internal datasets stored in the file. HDF5 format is not as easily readable as CSV format. However, phyddle uses gzip to automatically (de)compress records, which often leads to files that are over twenty times smaller than equivalent uncompressed CSV formatted tensors.
train
Training a network creates the following files in the workspace/example/train
directory:
./out.cpi_adjustments.csv
./out.train_aux_data_norm.csv
./out.train_est.labels.csv
./out.train_history.csv
./out.train_label_est_nocalib.csv
./out.train_label_norm.csv
./out.train_true.labels.csv
./out.trained_model.pkl
Descriptions of the files are as follows, with the prefix omitted for brevity:
* trained_model.pkl
: a saved file containing the trained PyTorch model
* train_label_norm.csv
and train_aux_data_norm.csv
: the location-scale values from the training dataset to (de)normalize the labels and auxiliary data from any dataset
* train_true.labels.csv
: the true values of labels for the training and test datasets, where columns correspond to estimated labels (e.g. model parameters)
* train_est.labels.csv
: the trained network estimates of labels for the training and test datasets, with calibrated prediction intervals, where columns correspond to point estimates and estimates for lower CPI and upper CPI bounds for each named label (e.g. model parameter)
* train_label_est_nocalib.csv
: the trained network estimates of labels for the training and test datasets, with uncalibrated prediction intervals
* train_history.csv
: the metrics across training epochs monitored during network training
* cpi_adjustments.csv
: calibrated prediction interval adjustments, where columns correspond to parameters, the first row contains lower bound adjustments, and the second row contains upper bound adjustments
estimate
The Estimate step will load empirical and simulated test datasets generated by the Format step, and then make new predictions using the network trained during the Train step. Estimation will produce the following estimates, so long as the formatted input datasets can be opened in the filesystem:
./out.empirical_est.labels.csv # output: estimated labels for empirical data
./out.test_est.labels.csv # output: estimated labels for test data
./out.test_true.labels.csv # output: true labels for test data
The out.empirical_est_labels.csv
and out.test_est.labels.csv
files
report the point estimates and lower and upper calibrated prediction
intervals (CPIs) for all parameters targeted by the param_est
setting.
Estimates for parameters appear across columns, where columns are grouped
first by label (e.g. parameter) and then statistic (e.g. value, lower-bound,
upper-bound). For example:
$ cat out.empirical_est.labels.csv
w_0_value,w_0_lower,w_0_upper,e_0_value,e_0_lower,e_0_upper,d_0_1_value,d_0_1_lower,d_0_1_upper,b_0_1_value,b_0_1_lower,b_0_1_upper
0.2867125345651129,0.1937433853918723,0.45733220552078013,0.02445545359384659,0.002880695707341881,0.10404499205878459,0.4502031713887769,0.1966340488593367,0.5147956690178682,0.06199703190510973,0.0015074254823161301,0.27544015163806645
The test_est.labels.csv and test_true.labels.csv files contain estimated and true label values for the simulated test dataset that were left aside during training. It is crucial that estimation accuracy against the test dataset is not used to inform the training process. If you view the test results and use it to modify Train settings, you should first randomly re-sample the training and test datasets from the Format step. This helps prevent overfitting and ensures that the test dataset is truly independent of the training procedure.
plot
The Plot step generates visualizations for results previously generated by Format, Train, and (when available) Estimate.
./est_CPI.pdf # results from Estimate step
./density_labels.pdf # label densities from Simulate/Format steps
./density_aux_data.pdf # aux. data densities from Simulate/Format steps
./pca_contour_labels.pdf # label PCA of Simulate/Format steps
./pca_contour_aux_data.pdf # aux. dataPCA of Simulate/Format steps
./estimate_test_{label}.pdf # estimation accuracy on train dataset
./estimate_train_{label}.pdf # estimation accuracy on test dataset
./history.pdf # training history for entire network
./network_architecture.pdf # neural network architecture
./summary.pdf # compiled report with all figures
./summary.csv # compiled text file with numerical results
empirical
The empirical
directory is used to store raw data for empirical analyses.
The network from Train is only trained to make accurate predictions
for datasets with the same format as the simulate
directory. That means
empirical datasets must have the same file types and formats as entries in
the simulate
directory. One difference is that empirical labels.csv
files will only contain entries for “known” parameters, as specified by
param_data
in the configuration; they will not contain the “unknown”
parameters to be estimated, specified by param_est
.
./empirical/viburnum.0.tre # tree file
./empirical/viburnum.0.dat.csv # data file
./empirical/viburnum.0.labels.csv # data-generating params
log
The log
directory contains logs for each phyddle analysis. Log files
are named according to the date and time of the analysis, and contain
runtime information that may be useful for debugging or reproducing results.
Visit Pipeline to learn more about the files.