Workspace

This section describes how phyddle organizes files and directories in its workspace. Visit Formats to learn more about file formats. Visit Configuration to learn more about managing directories and projects within a workspace.

Note

We strongly recommend that beginning users follow this general workspace filesystem design. Advanced users should find it is easy to customize locations for config files, simulation scripts, and output directories.

We recommend using the default directory structure for new projects to simplify project management. By default, dir is set to ./ and prefix is set to out. Output for each step is then stored in a subdirectory named after the step and beginning with the appropriate prefix.

For example, results from Simulate are stored in the simulate subdirectory of the dir` directory. If the sim_dir setting is provided, Simulate results are stored into that exact directory. For example, if dir is set to the local directory ./, then Simulate results are saved to ./simulate. If sim_dir is set to ../new_project/new_simulations then Simulate results are stored there regardless of the dir setting.

Similarly, if prefix is set to out, then all simulated datasets begin with the prefix out. If the setting sim_prefix is set to sim, then files generated by Simulate begin with the prefix sim.

Briefly, the workspace directory of a typical phyddle project contains two important files

  • config.py that specifies default settings for phyddle analyses in this project

  • sim_one.R (or a similar name) that defines a valid simulation script

and seven subdirectories for the pipeline analysis

  • simulate contains raw data generated by simulation

  • format contains data formatted into tensors for training networks

  • train contains trained networks and diagnostics

  • estimate contains new test datasets their estimations

  • plot contains figures of training and validation procedures

  • empirical contains raw data for your empirical analysis

  • log contains runtime logs for a phyddle project

This section will assume all steps are using the example project bundled with phyddle was generated using the command phyddle --make_cfg.

phyddle -c ./workspace/example/config.py --end_idx 25000

This corresponds to a 3-region equal-rates GeoSSE model. All directories have the complete file set, except ./simulate contains only 20 original examples.

A standard configuration for a project named example would store pipeline work into these directories:

./simulate       # output of Simulate step
./empirical      # your empirical dataset
./format         # output of Format step
./train          # output of Train step
./estimate       # output of Estimate step
./plot           # output of Plot step
./log            # logs for phyddle analyses

Soon, we give an overview of the standard files and formats corresponding to each pipeline directory. First, we describe a commands that help with workspace management.

You can easily save and share your project workspace with the following command:

cd workspace/example                         # current directory is root of
                                             #     example project directory

phyddle --save_proj example_lite.tar.gz      # save project workspace, but
                                             #     skip simulated training data
                                         #     (faster, smaller)

The resulting zip file (tarball) will contain the config file, the simulation script, and all workspace directories for pipeline steps. Note, the raw and formatted simulated training example datasets tend to be very large, and require substantial time and storage to archive, so they are not fully saved by default. To fully save all workspace project data, add the following options:

phyddle --save_proj example_full.tar.gz \    # save full project, and
        --save_train_fmt T \                 #     include simulated training data
        --save_num_sim 1000000               #     (slower, larger)

If you share the project with a collaborator or save it on a server, you can load the project for use with the command:

mkdir -p ~/new_workspace/new_project         # create new project directory
cd ~/new_workspace/new_project               # enter new project directory
phyddle --load_proj example_lite.tar.gz      # load project in directory
phyddle -s S --sim_more 10000                # (e.g.) simulate 10,000 training examples

Lastly, you can quickly remove all existing workspace directories, while preserving the config file and simulation scripts, with the following command:

cd workspace/example                         # enter directory to clean
phyddle --clean_proj                         # remove all local workspace directories

These are powerful commands, so be careful when using them. They can remove or overwrite files that you want to keep. Master these commands in a safe test directory before applying them to important workspace projects.

simulate

The Simulate step generates raw data from a simulating model that cannot yet be fed to the neural network for training. A typical simulation will produce the following files

./sim.0.tre              # tree file
./sim.0.dat.csv          # data file
./sim.0.labels.csv       # data-generating params

Each tree file contains a simple Newick string. Each data file contains state data either in Nexus format (.dat.nex) or simple comma-separated value format (.dat.csv) depending on the setting for char_format.

format

Applying Format to a directory of simulated datasets will output tensors containing the entire set of training examples, stored to, e.g. ./format. If the tensor_format setting is 'csv' (Comma-Separated Value, or CSV format), the formatted files are:

./out.empirical.phy_data.csv
./out.empirical.aux_data.csv
./out.empirical.labels.csv
./out.test.phy_data.csv
./out.test.aux_data.csv
./out.test.labels.csv
./out.train.phy_data.csv
./out.train.aux_data.csv
./out.train.labels.csv

where the phy_data.csv files contain one flattened Compact Phylogenetic Vector + States (CPV+S) entry per row, the aux_data.csv files contain one vector of auxiliary data (summary statistics and known parameters) values per row, and labels.csv contains one vector of label (estimated parameters) per row. Each row for each of the CSV files will correspond to a single, matched simulated training example. All files are stored in standard comma-separated value format, making them easily read by standard CSV-reading functions.

If the tensor_format setting is 'hdf5', the resulting files are:

./out.test.hdf5
./out.train.hdf5
./out.empirical.hdf5

where each HDF5 file contains all phylogenetic-state (CPV+S) data, auxiliary data, and label data. Individual simulated training examples share the same set of ordered examples across three internal datasets stored in the file. HDF5 format is not as easily readable as CSV format. However, phyddle uses gzip to automatically (de)compress records, which often leads to files that are over twenty times smaller than equivalent uncompressed CSV formatted tensors.

train

Training a network creates the following files in the workspace/example/train directory:

./out.cpi_adjustments.csv
./out.train_aux_data_norm.csv
./out.train_est.labels.csv
./out.train_history.csv
./out.train_label_est_nocalib.csv
./out.train_label_norm.csv
./out.train_true.labels.csv
./out.trained_model.pkl

Descriptions of the files are as follows, with the prefix omitted for brevity: * trained_model.pkl: a saved file containing the trained PyTorch model * train_label_norm.csv and train_aux_data_norm.csv: the location-scale values from the training dataset to (de)normalize the labels and auxiliary data from any dataset * train_true.labels.csv: the true values of labels for the training and test datasets, where columns correspond to estimated labels (e.g. model parameters) * train_est.labels.csv: the trained network estimates of labels for the training and test datasets, with calibrated prediction intervals, where columns correspond to point estimates and estimates for lower CPI and upper CPI bounds for each named label (e.g. model parameter) * train_label_est_nocalib.csv: the trained network estimates of labels for the training and test datasets, with uncalibrated prediction intervals * train_history.csv: the metrics across training epochs monitored during network training * cpi_adjustments.csv: calibrated prediction interval adjustments, where columns correspond to parameters, the first row contains lower bound adjustments, and the second row contains upper bound adjustments

estimate

The Estimate step will load empirical and simulated test datasets generated by the Format step, and then make new predictions using the network trained during the Train step. Estimation will produce the following estimates, so long as the formatted input datasets can be opened in the filesystem:

./out.empirical_est.labels.csv  # output: estimated labels for empirical data
./out.test_est.labels.csv       # output: estimated labels for test data
./out.test_true.labels.csv      # output: true labels for test data

The out.empirical_est_labels.csv and out.test_est.labels.csv files report the point estimates and lower and upper calibrated prediction intervals (CPIs) for all parameters targeted by the param_est setting. Estimates for parameters appear across columns, where columns are grouped first by label (e.g. parameter) and then statistic (e.g. value, lower-bound, upper-bound). For example:

$ cat out.empirical_est.labels.csv
w_0_value,w_0_lower,w_0_upper,e_0_value,e_0_lower,e_0_upper,d_0_1_value,d_0_1_lower,d_0_1_upper,b_0_1_value,b_0_1_lower,b_0_1_upper
0.2867125345651129,0.1937433853918723,0.45733220552078013,0.02445545359384659,0.002880695707341881,0.10404499205878459,0.4502031713887769,0.1966340488593367,0.5147956690178682,0.06199703190510973,0.0015074254823161301,0.27544015163806645

The test_est.labels.csv and test_true.labels.csv files contain estimated and true label values for the simulated test dataset that were left aside during training. It is crucial that estimation accuracy against the test dataset is not used to inform the training process. If you view the test results and use it to modify Train settings, you should first randomly re-sample the training and test datasets from the Format step. This helps prevent overfitting and ensures that the test dataset is truly independent of the training procedure.

plot

The Plot step generates visualizations for results previously generated by Format, Train, and (when available) Estimate.

./est_CPI.pdf                       # results from Estimate step
./density_labels.pdf                # label densities from Simulate/Format steps
./density_aux_data.pdf              # aux. data densities from Simulate/Format steps
./pca_contour_labels.pdf            # label PCA of Simulate/Format steps
./pca_contour_aux_data.pdf          # aux. dataPCA of Simulate/Format steps
./estimate_test_{label}.pdf         # estimation accuracy on train dataset
./estimate_train_{label}.pdf        # estimation accuracy on test dataset
./history.pdf                       # training history for entire network
./network_architecture.pdf          # neural network architecture
./summary.pdf                       # compiled report with all figures
./summary.csv                       # compiled text file with numerical results

empirical

The empirical directory is used to store raw data for empirical analyses. The network from Train is only trained to make accurate predictions for datasets with the same format as the simulate directory. That means empirical datasets must have the same file types and formats as entries in the simulate directory. One difference is that empirical labels.csv files will only contain entries for “known” parameters, as specified by param_data in the configuration; they will not contain the “unknown” parameters to be estimated, specified by param_est.

./empirical/viburnum.0.tre         # tree file
./empirical/viburnum.0.dat.csv     # data file
./empirical/viburnum.0.labels.csv  # data-generating params

log

The log directory contains logs for each phyddle analysis. Log files are named according to the date and time of the analysis, and contain runtime information that may be useful for debugging or reproducing results.

Visit Pipeline to learn more about the files.