Formats

Note

(incomplete) Important: This section assume the project name is ‘example’ while actual projects will likely use different names. Visit Glossary to learn more about how phyddle defines different terms.

This page describes different internal datatype formats and file formats used by phyddle.

Input datasets

phyddle can make phylogenetic model predictions against input datasets with previously trained networks. Valid phyddle input datasets contain a set of files with a shared filename prefix. For example, a dataset with the prefix out.3 would contain a tree file out.0.tre, a character matrix file out.3.dat.nex, and (when applicable) a ‘known parameters’ file out.3.labels.csv. Simulated training datasets and real biological datasets follow the same format.

Trees are encoded as raw data in simple Newick format. Trees are assumed to be rooted, bifurcating, time-calibrated trees. Trees may be ultrametric or non-ultrametric trees. Ultrametric trees should only be analyzed using treetype == ‘extant’. Non-ultrametric trees, such as those containing serially sampled viruses or fossils should be analyzed using treetype == ‘serial’. Here is an example of an extant tree with N=8 taxa.

$ cat ./simulate/out.0.tre
((((1:0.35994691486501296,2:0.35994691486501296):1.389952711060852,(3:1.5810568349100933,(4:0.5830569936279364,5:0.5830569936279364):0.9979998412821569):0.1688427910157717):5.655066077200624,6:7.404965703126489):0.3108578683347094,(7:0.7564319839861859,8:0.7564319839861859):6.959391587475013):2.2841764285388018;

Character data may be encoded in Nexus format (char_format = ‘nex’). Here is an example of a matrix with N=8 taxa and M=3 binary characters.

$ cat ./simulate/out.0.dat.nex
#NEXUS
Begin DATA;
Dimensions NTAX=8 NCHAR=3
Format MISSING=? GAP=- DATATYPE=STANDARD SYMBOLS="01";
Matrix
    1  001
    2  010
    3  100
    4  100
    5  001
    6  001
    7  100
    8  010
;
END;

Character data may also be encoded in csv format (char_format = ‘csv’). For example:

$ cat ./simulate/out.0.dat.nex
1,0,0,1
2,0,1,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,1,0,0
8,0,1,0

Some models will accept “known” data-generating parameters as input. For example, if not all taxa were included in the phylogeny, a model might accept a sampling fraction label as input. Any labels that are marked under the param_data setting will be encoded into the auxiliary data tensor during formatting. Example:

$ cat ./simulate/out.0.labels.csv
birth_1,birth_2,death,state_rate,sample_frac
0.5728,0.9082,0.1155,0.0372,0.1114

where the setting param_data == ['sample_frac'] would ensure that only the sample_frac entry is included in auxiliary data.

Tensor formats

Phylogenetic data (e.g. from a Newick file) and character matrix data (e.g. from a Nexus file) are encoded into compact phylogenetic state tensors. Internally, phyddle uses `dendropy.Tree` to represent phylogenies, pandas.DataFrame to represent character matrices (verify), and numpy.ndarray to store phylogenetic-state tensors.

There are two types of phylogenetic-state tensors in phyddle: the compact bijective ladderized vector + states (CBLV+S) and the compact diversity vector + states (CDV+S). CBLV+S is used for trees that contain serially sampled (non-ultrametric) taxa whereas CDV+S is used for trees that contain only extant (ultrametric) taxa. The tree_width of the encoding defines the maximum number of taxa the phylogenetic-state tensor may contain. The tree_encode setting determines if the tree is a 'serial' tree encoded with CBLV+S or an 'extant' tree encoded with CDV+S. Setting brlen_encode and char_encode alter how information is stored into the phylogenetic-state tensor.

CBLV+S

This is an example for the CBLV+S encoding of 5 taxa with 2 characters. This is the Newick string:

(((A:2,B:1):1,(C:3,D:2):3):1,E:2);

This is the Nexus file:

#NEXUS
Begin DATA;
Dimensions NTAX=5 NCHAR=2
Format MISSING=? GAP=- DATATYPE=STANDARD SYMBOLS="01";
Matrix
    A  01
    B  11
    C  10
    D  10
    E  01
;
END;

These data can be encoded in different ways, based on the char_encode setting. When char_encode == 'integer' then the encoding will treat each character as a row in the resulting data matrix, and assign the appropriate integer-valued state to that character for each taxon. Alternatively, when char_encode == 'one_hot' then the encoding will treat every distinct state-character combination as its own row in the resulting data matrix, then mark each species as 1 for a cell when a species has that character-state and 0 if not. One-hot encoding is applied individually to each homologous character (fewer distinct combinations) not against the entire character set (more distinct combinations).

Ladderizing clades by maximum root-to-tip distance orders the taxa C, D, A, B, then E, which correspond to the first five entries of the CBLV+S tensor. When brlen_encode is set to 'height_only' the un-rescaled CBLV+S file would look like this:

# NOTE: The CBLV+S tensor is shown in this orientation for readability.
#       phyddle stores the tensor as the transpose of this in memory,
#       meaning rows correspond to taxa, and columns correspond to branch
#       length information.

# C,D,A,B,E,-,-,-,-,-
  7,2,3,1,2,0,0,0,0,0  # tip-to-node distance
  0,4,1,2,0,0,0,0,0,0  # node-to-root distance
  1,1,0,1,0,0,0,0,0,0  # character 1
  0,0,1,1,1,0,0,0,0,0  # character 2

and like this when brlen_encode is set to 'height_brlen':

# C,D,A,B,E,-,-,-,-,-
  7,2,3,1,2,0,0,0,0,0  # tip-to-node distance
  0,4,1,2,0,0,0,0,0,0  # node-to-root distance
  3,2,2,1,2,0,0,0,0,0  # tip edge length
  0,3,1,1,0,0,0,0,0,0  # node edge length
  1,1,0,1,0,0,0,0,0,0  # character 1
  0,0,1,1,1,0,0,0,0,0  # character 2

By default, all branch length entries are rescaled from 0 to 1 as proportion to tree height (formatted to ease reading):

#    C,   D,   A,   B,   E,   -,   -,   -,   -,   -
  1.00,0.29,0.43,0.14,0.29,   0,   0,   0,   0,   0  # tip-to-node distance
  0.00,0.57,0.14,0.29,0.00,   0,   0,   0,   0,   0  # node-to-root distance
  0.43,0.29,0.29,0.14,0.29,   0,   0,   0,   0,   0  # tip edge length
  0.00,0.43,0.14,0.14,0.00,   0,   0,   0,   0,   0  # node edge length
     1,   1,   0,   1,   0,   0,   0,   0,   0,   0  # character 1
     0,   0,   1,   1,   1,   0,   0,   0,   0,   0  # character 2

CDV+S

CDV+S is used to encode phylogenetic-state information for trees of only extant taxa. CDV+S has a similar structure to CBLV+S, except in two principal ways. First, CDV+S uses total subclade diversity rather than tip node with max distance-from-root-node to determine how to ladderize the tree, which in turn determines which columns are associated with which tip nodes. Second, because CDV+S is used for extant-only trees, it does not need to report the redundant information about tip-to-node distances, as the tip-to-root distances are equal among all tips (by definition). This means that CDV+S does not contain a row with tip-to-node distances (the first row of CBLV+S).

For example, the following Newick string for an ultrametric tree

(((A:5,B:5):1,(C:3,D:3):3):1,E:7);

and associating the same character data as above with taxa A through E yields the following CDV+S tensor:

# NOTE: The CDV+S tensor is shown in this orientation for readability.
#       phyddle stores the tensor as the transpose of this in memory,
#       meaning rows correspond to taxa, and columns correspond to branch
#       length information.

# C,D,A,B,E,-,-,-,-,-
  0,4,1,2,0,0,0,0,0,0  # node-to-root distance
  3,2,2,1,2,0,0,0,0,0  # tip edge length
  0,3,1,1,0,0,0,0,0,0  # node edge length
  1,1,0,1,0,0,0,0,0,0  # character 1
  0,0,1,1,1,0,0,0,0,0  # character 2

Auxiliary data

The auxiliary data tensor contains a panel of summary statistics extracted from the inputted phylogeny and character data matrix for a given dataset. Currently, phyddle generates the following summary statistics:

tree_length       # sum of branch lengths
num_taxa          # number of terminal taxa in tree/data
root_age          # longest root-to-tip distance
brlen_mean        # mean of branch lengths
brlen_var         # variance of branch lengths
brlen_skew        # skewness of branch lengths
age_mean          # mean of internal node ages
age_var           # variance of internal node ages
age_skew          # skewness of internal node ages
B1                # B1 tree measure (Dendropy)
N_bar             # N_bar tree measure (Dendropy)
colless           # Colless tree measure (Dendropy)
treeness          # treeness measure (Dendropy)
f_dat_0           # frequency of taxa with character in state 0
f_dat_1           # frequency of taxa with character in state 1
...

The auxiliary data tensor also contains any parameter values that shape the data-generating process, but can be treated as “known” rather than needing to be estimated. For example, the epidemiologists may assume they know the rate of infection recovery (gamma) based on public health or clinical data. Parameters may be treated as data by providing the labels for those parameters in the param_data entry of the config file. For example, setting 'param_data' : [ 'sample_frac' ] could be used to inform phyddle that the recovery rate and susceptible population sizes for location 0 are known for a phylogenetic SIR analysis.