This page contains all info about the PhyloAcc program including its inputs, options, and outputs.
With Anaconda already setup, Phyloacc can be installed with the command conda install phyloacc
.
For more detailed installation instructions, see the install page.
Note that PhyloAcc is currently only compatible on Linux and OSX operating systems, though can be run on Windows with the Windows Subsystem for Linux (WSL).
You will need the following data to perform an analysis with PhyloAcc
- A set of alignments for a set of species from the regions you wish to estimate substitution rates for (e.g. CNEEs).
- A phylogenetic tree in Newick format from the same set of species with branch lengths estimated in terms of relative number of substitutions corresponding to neutral/background substitution rates.
- A transition rate matrix for bases in the neutral/background model.
-
For running the gene tree model, a phylogenetic tree with the same topology as the one provided in (2), but with branch lengths in
coalescent units. This can be obtained from species tree inference methods like
MP-EST or
ASTRAL.
PhyloAcc has the capability to to estimate this directly from the input alignments with the
--theta
option, by building locus trees using IQ-TREE and then using those as input to ASTRAL. This will be done for at most the 5000 longest alignments that are longer than 100bp and have at least 20% informative sites. These requirements are to ensure there is phylogenetic signal to infer this tree given that some inputs to PhyloAcc may be conserved and have few variable sites. However, you must be sure the alignments you are estimating rates for with PhyloAcc are also suitable for tree inference if you use this option. Also note that using the--theta
option may significantly add to the runtime of the workflow.
(2) and (3) can be obtained running phyloFit on alignments of likely neutrally evolving sites (e.g. 4-fold degenerate sites in genes) and will be given in a single .mod file output from that program.
PhyloAcc version 2 now facilitates parallelization on computing clusters by using Python to batch loci and Snakemake to submit those batches to the cluster. This results in a three-step process to run PhyloAcc:
-
Process and batch input alignments with the PhyloAcc interface (
phyloacc.py
). -
Submit the batches as jobs on the SLURM cluster (
snakemake [generated snakefile]
). -
Gather outputs from batches into single files (
phyloacc-post.py
).
Below we outline several ways to batch loci ("Setting up batches") as well as examples of how to submit the snakefile and gather the outputs.
NOTE: As of v2.3.1, you can now specify options to the phyloacc.py script in two ways:
- By passing the options in the command line (e.g.
phyloacc.py -d [alignment directory] -m [mod file] -o [output directory] ...
), as outlined here. - By specifying the option and value in a config file (e.g.
phyloacc.py --config [config file]
). See below for more info.
phyloacc.py \
-d [directory containing multiple FASTA formatted nucleotide alignments] \
-m [mod file from phyloFit with input tree and neutral rate matrix] \
-o [desired output directory] \
-t "[semi-colon separated list of target branches in the species tree]" \
-j [number of jobs/batches to split the input alignments into] \
-p [processes to use per job/batch of alignments] \
-batch [number of alignments per job/batch] \
-part "[comma separated list of SLURM partitions to submit batches to as jobs]"
phyloacc.py \
-d [directory containing multiple FASTA formatted nucleotide alignments] \
-m [mod file from phyloFit with input tree and neutral rate matrix] \
-r gt \
-l [file with Newick formatted species tree with branch lengths in coalescent units] \
-o [desired output directory] \
-t "[semi-colon separated list of target branches in the species tree]" \
-j [number of jobs/batches to split the input alignments into] \
-p [processes to use per job/batch of alignments] \
-batch [number of alignments per job/batch] \
-part "[comma separated list of SLURM partitions to submit batches to as jobs]"
phyloacc.py \
-d [directory containing multiple FASTA formatted nucleotide alignments] \
-m [mod file from phyloFit with input tree and neutral rate matrix] \
-r adaptive \
-l [file with Newick formatted species tree with branch lengths in coalescent units] \
-o [desired output directory] \
-t "[semi-colon separated list of target branches in the species tree]" \
-j [number of jobs/batches to split the input alignments into] \
-p [processes to use per job/batch of alignments] \
-batch [number of alignments per job/batch] \
-part "[comma separated list of SLURM partitions to submit batches to as jobs]"
Starting with v2.3.1, options for phyloacc.py
can be specified either in the command line, as above, or in a configuration file in YAML format and by using the --config [config file]
syntax.
We provide a template config file, which lists all the options for phyloacc.py with no values specified: Config file template.
In this file, options are specified in key: value pairs, with the key being the option name to the left of the colon and the desired value being to the right of the colon.
For example, one could run the program with the following command:
phyloacc.py -a alignment.fa -b loci.bed -m model.mod -t "species1;species2;species3"
OR one could specify the same options in a config file named phyloacc-cfg.yaml like so:
alignment: alignment.fa
bed: loci.bed
mod: model.mod
targets: "species1;species2;species3"
And then run the program with the following command:
phyloacc.py --config phyloacc-cfg.yaml
These are identical ways to run the same inputs.
IMPORTANT: Options given via the command line take precedence over those givin in the config file.
For instance, if I use the same config file above, but run the command:
phyloacc.py --config phyloacc-cfg.yaml -a other_alignment.fa
Then the alignment.fa
input in the config file will be ignored and other_alignment.fa
will be used instead.
If an option is not specified in either the command line or the config file, a default value will be used. See the Options section for more details.
Boolean options in the config file
For boolean options in the config file (theta_flag
, dollo_flag
, overwrite_flag
, labeltree
,
summarize_flag
, filter_alns
, options_flag
, append_log_flag
, info_flag
,
version_flag
, quiet_flag
),
you can specify them in the config file as either true
or false
. In other words:
phyloacc.py --config phyloacc-cfg.yaml --overwrite
is equivalent to setting
overwrite_flag: true
in the config file and running
phyloacc.py --config phyloacc-cfg.yaml
In practice, for many of these boolean options, using the command line option may be easier than changing the config file each time.
snakemake -p -s \
[path to snakefile.smk] \
--configfile [path to config file] \
--profile [path to cluster profile] \
--dryrun
Note that all files for the snakemake command are automatically generated when running
phyloacc.py
and the exact command to run will be printed to the screen and
written to the log and summary files.
Always try to run the snakemake
command with the --dryrun
option to catch any errors before the jobs are submitted. After the dry run has completed successfully, remove
--dryrun
from the command to execute the workflow and start job submission to the
cluster.
phyloacc_post.py \
-i [path output directory specified when running phyloacc.py]
After running phyloacc_post.py
, outputs from each batch will be combined and a summary
HTML file will be created with some preliminary summaries of results. This file will be found in the main output
directory from the phyloacc.py
command (specified with the -o
option).
Raw output files are also available in the output directory specified with phyloacc_post.py
,
with the default directory name being results/
.
The raw files are tab delimited and described below.
Marginal log-likelihood for all models (integrating out parameters and latent states), Bayes factors, rates, and states for each locus.
Column header | Column description |
---|---|
phyloacc.id | The number assigned to this locus by PhyloAcc, with the format [batch number]-[locus number] |
original.id | The ID of the locus provided in the input (bed file or fasta file) |
best.fit.model | The model (M0, M1, or M2) that best fits the data for this locus given the specified Bayes Factor cutoffs |
marginal.likelihood.m0 | Marginal log-likelihood under the null model (M0) |
marginal.likelihood.m1 | Marginal log-likelihood under target model (M1) |
marginal.likelihood.m2 | Marginal log-likelihood under the unrestricted full model (M2) |
logbf1 | log Bayes factor between null (M0) and target (M1) models |
logbf2 | log Bayes factor between target (M1) and full (M2) models |
logbf3 | log Bayes factor between full (M2) and null (M0) models |
conserved.rate.m0 | The posterior median of the conserved substitution rate under M0 |
accel.rate.m0 | The posterior median of the accelerated substitution rate under M0 |
conserved.rate.m1 | The posterior median of the conserved substitution rate under M1 |
accel.rate.m1 | The posterior median of the accelerated substitution rate under M1 |
conserved.rate.m2 | The posterior median of the conserved substitution rate under M2 |
accel.rate.m2 | The posterior median of the accelerated substitution rate under M2 |
num.accel.m1 | The number of lineages inferred to be accelerated under M1 |
num.accel.m2 | The number of lineages inferred to be accelerated under M2 |
conserved.lineages.m1 | A comma separated list of the conserved lineages under M1 |
accel.lineages.m1 | A comma separated list of the accelerated lineages under M1 |
conserved.lineages.m2 | A comma separated list of the conserved lineages under M2 |
accel.lineages.m1 | A comma separated list of the accelerated lineages under M1 |
Maximum log-likelihood configurations of latent state Z under null, accelerated and full model, with Z=-1 (if the element is 'missing' in the branches of outgroup species), 0 (background), 1 (conserved), 2 (accelerated).
Each row corresponds to an input element and each column a branch in the tree. If an element is filtered because of too many alignment gaps all the columns will be zero.
Posterior median of conserved rate, accelerated rate, probability of gain and loss conservation (and \(\beta = P(Z=1\rightarrow Z=2)\)), and posterior probability of being in each latent state on each branch for each element.
Column header | Column description |
---|---|
Locus ID | [batch number]-[locus number] |
n_rate | Posterior median of accelerated substitution rate |
c_rate | Posterior median of conserved substitution rate |
g_rate | Posterior median of \( \alpha \) |
l_rate | Posterior median of \( \beta \) |
l2_rate | Posterior median of \( \beta_2 = P(Z = 0 \rightarrow Z = 2) \), which is 0 in current implementation |
From the 7th column and on, there are four columns for each branch in the tree: *_0 indicates whether it's "missing"; *_1, *_2 and *_3 are the posterior probability in the background, conserved and accelerated state respectively. The algorithm will prune "missing" branches within outgroup and set the latent states of them to -1 so that the three posterior probabilities are all zero. Column names indicate the branch and the order of the branch is the same as that in prefix_elem_Z.txt.
Below are the options available for the phyloacc.py
script. The REQUIRED options are:
- Sequences: one of (
-a
and-b
) or-d
- Tree and model:
-m
- Target species:
-t
- Cluster partition:
-part
All other parameters are optional or have default values. Options can be specified in the command line or in a config file. While they are optional,
it is encouraged to optimize the workflow by specifying the options that are relevant to your analysis. Specifically, a named output directory (-o
) will
help you organize your files. And the batching and cluster options should be set to match the resources available to you.
Also note that while a cluster partition is required to be specified with -part
, if you do not wish to run the resulting batches on a cluster you can specify any string
since no checks are done to ensure the partition exists. Then you can run the batches via snakemake without specifying the --profile
option, or run the batches individually
with the config files generated in [your output directory]/phyloacc-job-files/cfgs/
. Though due to the run-time of the model, it his highly recommended to run the batches on a cluster
with dedicated compute nodes.
Command line option | Config file key | Description | Default value |
---|---|---|---|
-a [FASTA FILE]
|
aln_file: [FASTA FILE]
|
An alignment file with all loci concatenated. -b must also be specified.
Expected as FASTA format for now.
|
One of (-a and -b ) or -d is REQUIRED.
|
-b [BED FILE]
|
bed_file: [BED FILE]
|
A bed file with coordinates for the loci in the concatenated alignment file.
-a must also be specified.
|
One of (-a and -b ) or -d is REQUIRED.
|
-i [TEXT FILE]
|
id_file: [TEXT FILE]
|
A text file with locus names, one per line, corresponding to regions in the input bed file. If provided, PhyloAcc will only be run on these loci. |
Optional. -a and -b must also be specified.
|
-d [DIRECTORY]
|
aln_dir: [DIRECTORY]
|
A directory containing individual alignment files for each locus. Expected as FASTA format for now. |
One of (-a and -b ) or -d is REQUIRED.
|
Command line option | Config file key | Description | Default value |
---|---|---|---|
-m [MOD FILE]
|
mod_file: [MOD FILE]
|
A file with a background transition rate matrix and phylogenetic tree with branch lengths as output from phyloFit. | REQUIRED. |
-t "[STRING]"
|
targets: [STRING]
|
Tip labels in the input tree to be used as target species. Enter multiple labels separated by semi-colons (;). | REQUIRED. |
-c "[STRING]"
|
conserved: [STRING]
|
Tip labels in the input tree to be used as conserved species. Enter multiple labels separated by semi-colons (;). |
Optional. Any species not specified in -t or -g will be set as conserved.
|
-g "[STRING]"
|
outgroup: [STRING]
|
Tip labels in the input tree to be used as outgroup species. Enter multiple labels separated by semi-colons (;). | Optional. |
-l [NEWICK FILE]
|
coal_tree: [NEWICK FILE]
|
A file containing a rooted, Newick formatted tree with the same topology as the species tree in the mod file (-m ),
but with branch lengths in coalescent units.
|
When the gene tree model is used, one of -l or --theta must be set.
|
--theta
|
theta_flag: [true/false]
|
Set this to add gene tree estimation with IQ-tree and species estimation with ASTRAL for estimation
of the theta prior. Note that a species tree with branch lengths in units of substitutions per site
is still required with -m . Also note that this may add substantial runtime to the pipeline.
|
When the gene tree model is used, one of -l or --theta must be set.
|
Command line option | Config file key | Description | Default value |
---|---|---|---|
-r [st/gt/adaptive]
|
run_mode: [st/gt/adaptive]
|
Determines which version of PhyloAcc will be used. st: use the species tree model for all loci, gt: use the gene tree model for all loci, adaptive: use the gene tree model on loci with many branches with low sCF and species tree model on all other loci. | st |
--dollo
|
dollo_flag: [true/false]
|
Set this to use the Dollo assumption in the original model, which assumes that once a lineage has been inferred to be in an accelerated state, it and its descendants cannot change state. By default, this assumption is no longer used (false). | false |
Command line option | Config file key | Description | Default value |
---|---|---|---|
--config [YAML FILE]
|
NA
|
The path to a YAML formatted configuration file that specifies the program options (see above). Note that options specified on the command line take precedence of those specified in the config file. | Optional |
-n [INT]
|
num_procs: [INT]
|
The number of processes that this script should use. | 1 |
Command line option | Config file key | Description | Default value |
---|---|---|---|
-o [DIRECTORY]
|
out_dir: [DIRECTORY]
|
Desired output directory. This will be created for you if it doesn't exist. | phyloacc-[date]-[time] |
--overwrite
|
overwrite_flag: [true/false]
|
Set this to overwrite existing files. | Optional. |
--labeltree
|
labeltree: [true/false]
|
Simply reads the tree from the input mod file (-m ), labels the internal nodes, and exits.
|
Optional. |
--summarize
|
summarize_flag: [true/false]
|
Only generate the input summary plots and page. Do not write or overwrite batch job files. | Optional. |
Command line option | Config file key | Description | Default value |
---|---|---|---|
--filter
|
filter_alns: [true/false]
|
By default, any locus with at least 1 informative site is reatained for PhyloAcc. Set this to filter out loci that have at least 50% of sites that are 50% or more gap charcaters OR that have 50% of sequences that are made up of 50% or more gap charcaters. | Optional. |
Command line option | Config file key | Description | Default value |
---|---|---|---|
-scf [FLOAT]
|
scf_branch_cutoff: [FLOAT]
|
The value of sCF to consider as low for any given branch or locus. Must be between 0 and 1. | 0.5 |
-s [FLOAT]
|
scf_prop: [FLOAT]
|
A value between 0 and 1. If provided, this proportion of branches must have sCF below -scf
to be considered for the gene tree model. Otherwise, branch sCF values will be averaged for each locus.
|
Optional. |
Command line option | Config file key | Description | Default value |
---|---|---|---|
-mcmc [INT]
|
mcmc: [INT]
|
The total number of steps in the Markov chain. | 1000 |
-burnin [INT]
|
burnin: [INT]
|
The number of steps to be discarded in the Markov chain as burnin. | 500 |
-chain [INT]
|
chain: [INT]
|
The number of MCMC chains to run. | 1 |
-thin [INT]
|
thin: [INT]
|
For the gene tree model, the number of MCMC steps between gene tree sampling.
The total number of MCMC steps specified with -mcmc will be scaled by this as mcmc*thin.
|
1 |
Command line option | Config file key | Description | Default value |
---|---|---|---|
-batch [INT]
|
batch_size: [INT]
|
The number of loci to run per batch. | 50 |
-p [INT]
|
procs_per_batch [INT]
|
The number of processes to use for each batch of PhyloAcc. | 1 |
-j [INT]
|
num_jobs: [INT]
|
The number of jobs (batches) to run in parallel. | 1 |
-part "[STRING]"
|
cluster_part: [STRING]
|
The SLURM partition or list of partitions (separated by commas) on which to run PhyloAcc jobs. | REQUIRED. |
-nodes [INT]
|
cluster_nodes: [INT]
|
The number of nodes on the specified partition to submit jobs to. | 1 |
-mem [INT]
|
cluster_mem: [INT]
|
The max memory for each job in MB. | 4 |
-time [INT]
|
cluster_time [INT]
|
The time in minutes to give each job. | 1 |
--local
|
local_flag [true/false]
|
Set this to generate a local snakemake workflow rather than one that will be submitted through SLURM with a profile. For testing purposes. | Optional. |
Command line option | Config file key | Description | Default value |
---|---|---|---|
-st-path [STRING]
|
phyloacc_st_path [STRING]
|
The path to the PhyloAcc-ST binary. |
PhyloAcc-ST
|
-gt-path [STRING]
|
phyloacc_gt_path [STRING]
|
The path to the PhyloAcc-GT binary if -r gt or -r adaptive are set.
|
PhyloAcc-GT
|
-iqtree-path "[STRING]"
|
iqtree_path: [STRING]
|
When --theta is set, gene trees will be inferred from some loci with IQ-TREE.
You can provide the path to your iqtree executable with this option
|
iqtree
|
-coal-path "[STRING]"
|
coal_cmd: [STRING]
|
When --theta is set, branch lengths on your species tree will be estimated in coalescent units with an external program. Currently
ASTRAL With this option you can provide the command to execute your astral.jar file,
including any java options. For example, "java -Xmx8g -jar astral.jar " would be a valid command to specify, provided you had a jar file called astral.jar .
|
java -jar astral.jar
|
Command line option | Config file key | Description | Default value |
---|---|---|---|
-phyloacc "[STRING]"
|
phyloacc_opts: [STRING]
|
A catch-all option for other PhyloAcc parameters. Enter as a semi-colon delimited list of options: 'OPT1 value;OPT2 value' | Optional. |
--options
|
options_flag: [true/false]
|
Print the full list of PhyloAcc options that can be specified with -phyloacc and exit.
|
Optional. |
Command line option | Config file key | Description | Default value |
---|---|---|---|
--testcmd
|
testcmd_flag: [true/false]
|
Also print out a command to run a single batch of PhyloAcc directly (without snakemake). For testing purposes. | Optional. |
--depcheck
|
depcheck: [true/false]
|
Run this to check that all dependencies are installed at the provided path. No other options necessary. | Optional. |
--appendlog
|
append_log_flag: [true/false]
|
Set this to keep the old log file even if --overwrite is specified.
New log information will instead be appended to the previous log file.
|
Optional. |
--info
|
info_flag: [true/false]
|
Print some meta information about the program and exit. No other options required. | Optional. |
--version
|
version_flag: [true/false]
|
Simply print the version and exit. Can also be called as -version , -v , or --v .
|
Optional. |
--quiet
|
quiet_flag: [true/false]
|
Set this flag to prevent PhyloAcc from reporting detailed information about each step. | Optional. |
-h
|
NA
|
Print a help menu and exit. Can also be called as --help .
|
Optional. |