Formats
Note
(incomplete) Important: This section assume the project name is ‘example’ while actual projects will likely use different names. Visit Glossary to learn more about how phyddle defines different terms.
This page describes different internal datatype formats and file formats used by phyddle.
Input datasets
phyddle can make phylogenetic model predictions against input datasets with
previously trained networks. Valid phyddle input datasets contain a set of
files with a shared filename prefix. For example, a dataset with the prefix
out.3
would contain a tree file out.0.tre
, a character matrix file
out.3.dat.nex
, and (when applicable) a ‘known parameters’ file
out.3.labels.csv
. Simulated training datasets and real biological
datasets follow the same format.
Trees are encoded as raw data in simple Newick format. Trees are assumed to be rooted, bifurcating, time-calibrated trees. Trees may be ultrametric or non-ultrametric trees. Ultrametric trees should only be analyzed using treetype == ‘extant’. Non-ultrametric trees, such as those containing serially sampled viruses or fossils should be analyzed using treetype == ‘serial’. Here is an example of an extant tree with N=8 taxa.
$ cat ./simulate/out.0.tre
((((1:0.35994691486501296,2:0.35994691486501296):1.389952711060852,(3:1.5810568349100933,(4:0.5830569936279364,5:0.5830569936279364):0.9979998412821569):0.1688427910157717):5.655066077200624,6:7.404965703126489):0.3108578683347094,(7:0.7564319839861859,8:0.7564319839861859):6.959391587475013):2.2841764285388018;
Character data may be encoded in Nexus format (char_format = ‘nex’). Here is an example of a matrix with N=8 taxa and M=3 binary characters.
$ cat ./simulate/out.0.dat.nex
#NEXUS
Begin DATA;
Dimensions NTAX=8 NCHAR=3
Format MISSING=? GAP=- DATATYPE=STANDARD SYMBOLS="01";
Matrix
1 001
2 010
3 100
4 100
5 001
6 001
7 100
8 010
;
END;
Character data may also be encoded in csv format (char_format = ‘csv’). For example:
$ cat ./simulate/out.0.dat.nex
1,0,0,1
2,0,1,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,1,0,0
8,0,1,0
Some models will accept “known” data-generating parameters as input. For example,
if not all taxa were included in the phylogeny, a model might accept a sampling
fraction label as input. Any labels that are marked under the param_data
setting will be encoded into the auxiliary data tensor during formatting. Example:
$ cat ./simulate/out.0.labels.csv
birth_1,birth_2,death,state_rate,sample_frac
0.5728,0.9082,0.1155,0.0372,0.1114
where the setting param_data == ['sample_frac']
would ensure that only the
sample_frac
entry is included in auxiliary data.
Tensor formats
Phylogenetic data (e.g. from a Newick file) and character matrix data (e.g.
from a Nexus file) are encoded into compact phylogenetic state tensors.
Internally, phyddle uses `dendropy.Tree`
to represent phylogenies,
pandas.DataFrame
to represent character matrices (verify), and
numpy.ndarray
to store phylogenetic-state tensors.
There are two types of phylogenetic-state tensors in phyddle: the compact
bijective ladderized vector + states (CBLV+S) and the compact diversity vector +
states (CDV+S). CBLV+S is used for trees that contain serially sampled
(non-ultrametric) taxa whereas CDV+S is used for trees that contain only extant
(ultrametric) taxa. The tree_width
of the encoding defines the maximum number
of taxa the phylogenetic-state tensor may contain. The tree_encode
setting
determines if the tree is a 'serial'
tree encoded with CBLV+S or an
'extant'
tree encoded with CDV+S. Setting brlen_encode
and
char_encode
alter how information is stored into the
phylogenetic-state tensor.
CBLV+S
This is an example for the CBLV+S encoding of 5 taxa with 2 characters. This is the Newick string:
(((A:2,B:1):1,(C:3,D:2):3):1,E:2);
This is the Nexus file:
#NEXUS
Begin DATA;
Dimensions NTAX=5 NCHAR=2
Format MISSING=? GAP=- DATATYPE=STANDARD SYMBOLS="01";
Matrix
A 01
B 11
C 10
D 10
E 01
;
END;
These data can be encoded in different ways, based on the char_encode
setting. When char_encode == 'integer'
then the encoding will treat
each character as a row in the resulting data matrix, and assign the
appropriate integer-valued state to that character for each taxon.
Alternatively, when char_encode == 'one_hot'
then the encoding will
treat every distinct state-character combination as its own row in the
resulting data matrix, then mark each species as 1
for a cell when a
species has that character-state and 0
if not. One-hot encoding is
applied individually to each homologous character (fewer distinct combinations)
not against the entire character set (more distinct combinations).
Ladderizing clades by maximum root-to-tip distance orders the taxa C, D, A,
B, then E, which correspond to the first five entries of the CBLV+S tensor.
When brlen_encode
is set to 'height_only'
the un-rescaled CBLV+S file
would look like this:
# NOTE: The CBLV+S tensor is shown in this orientation for readability.
# phyddle stores the tensor as the transpose of this in memory,
# meaning rows correspond to taxa, and columns correspond to branch
# length information.
# C,D,A,B,E,-,-,-,-,-
7,2,3,1,2,0,0,0,0,0 # tip-to-node distance
0,4,1,2,0,0,0,0,0,0 # node-to-root distance
1,1,0,1,0,0,0,0,0,0 # character 1
0,0,1,1,1,0,0,0,0,0 # character 2
and like this when brlen_encode
is set to 'height_brlen'
:
# C,D,A,B,E,-,-,-,-,-
7,2,3,1,2,0,0,0,0,0 # tip-to-node distance
0,4,1,2,0,0,0,0,0,0 # node-to-root distance
3,2,2,1,2,0,0,0,0,0 # tip edge length
0,3,1,1,0,0,0,0,0,0 # node edge length
1,1,0,1,0,0,0,0,0,0 # character 1
0,0,1,1,1,0,0,0,0,0 # character 2
By default, all branch length entries are rescaled from 0 to 1 as proportion to tree height (formatted to ease reading):
# C, D, A, B, E, -, -, -, -, -
1.00,0.29,0.43,0.14,0.29, 0, 0, 0, 0, 0 # tip-to-node distance
0.00,0.57,0.14,0.29,0.00, 0, 0, 0, 0, 0 # node-to-root distance
0.43,0.29,0.29,0.14,0.29, 0, 0, 0, 0, 0 # tip edge length
0.00,0.43,0.14,0.14,0.00, 0, 0, 0, 0, 0 # node edge length
1, 1, 0, 1, 0, 0, 0, 0, 0, 0 # character 1
0, 0, 1, 1, 1, 0, 0, 0, 0, 0 # character 2
CDV+S
CDV+S is used to encode phylogenetic-state information for trees of only extant taxa. CDV+S has a similar structure to CBLV+S, except in two principal ways. First, CDV+S uses total subclade diversity rather than tip node with max distance-from-root-node to determine how to ladderize the tree, which in turn determines which columns are associated with which tip nodes. Second, because CDV+S is used for extant-only trees, it does not need to report the redundant information about tip-to-node distances, as the tip-to-root distances are equal among all tips (by definition). This means that CDV+S does not contain a row with tip-to-node distances (the first row of CBLV+S).
For example, the following Newick string for an ultrametric tree
(((A:5,B:5):1,(C:3,D:3):3):1,E:7);
and associating the same character data as above with taxa A through E yields the following CDV+S tensor:
# NOTE: The CDV+S tensor is shown in this orientation for readability.
# phyddle stores the tensor as the transpose of this in memory,
# meaning rows correspond to taxa, and columns correspond to branch
# length information.
# C,D,A,B,E,-,-,-,-,-
0,4,1,2,0,0,0,0,0,0 # node-to-root distance
3,2,2,1,2,0,0,0,0,0 # tip edge length
0,3,1,1,0,0,0,0,0,0 # node edge length
1,1,0,1,0,0,0,0,0,0 # character 1
0,0,1,1,1,0,0,0,0,0 # character 2
Auxiliary data
The auxiliary data tensor contains a panel of summary statistics extracted from the inputted phylogeny and character data matrix for a given dataset. Currently, phyddle generates the following summary statistics:
tree_length # sum of branch lengths
num_taxa # number of terminal taxa in tree/data
root_age # longest root-to-tip distance
brlen_mean # mean of branch lengths
brlen_var # variance of branch lengths
brlen_skew # skewness of branch lengths
age_mean # mean of internal node ages
age_var # variance of internal node ages
age_skew # skewness of internal node ages
B1 # B1 tree measure (Dendropy)
N_bar # N_bar tree measure (Dendropy)
colless # Colless tree measure (Dendropy)
treeness # treeness measure (Dendropy)
f_dat_0 # frequency of taxa with character in state 0
f_dat_1 # frequency of taxa with character in state 1
...
The auxiliary data tensor also contains any parameter values that shape the
data-generating process, but can be treated as “known” rather than needing to
be estimated. For example, the epidemiologists may assume they know the rate of
infection recovery (gamma) based on public health or clinical data. Parameters
may be treated as data by providing the labels for those parameters in the
param_data
entry of the config file. For example, setting 'param_data' :
[ 'sample_frac' ]
could be used to inform phyddle that the recovery
rate and susceptible population sizes for location 0 are known for a
phylogenetic SIR analysis.