This page contains all info about the PhyloAcc program including its inputs, options, and outputs.
With Anaconda already setup, Phyloacc can be installed with the command conda install phyloacc
.
For more detailed installation instructions, see the install page.
Note that PhyloAcc is currently only compatible on Linux and OSX operating systems.
You will need the following data to perform an analysis with PhyloAcc
- A set of alignments for a set of species from the regions you wish to estimate substitution rates for (e.g. CNEEs).
- A phylogenetic tree in Newick format from the same set of species with branch lengths estimated in terms of relative number of substitutions corresponding to neutral/background substitution rates.
- A transition rate matrix for bases in the neutral/background model.
-
For running the gene tree model, a phylogenetic tree with the same topology as the one provided in (2), but with branch lengths in
coalescent units. This can be obtained from species tree inference methods like
MP-EST or
ASTRAL.
PhyloAcc has the capability to to estimate this directly from the input alignments with the
--theta
option, by building locus trees using IQ-TREE and then using those as input to ASTRAL. This will be done for at most the 5000 longest alignments that are longer than 100bp and have at least 20% informative sites. These requirements are to ensure there is phylogenetic signal to infer this tree given that some inputs to PhyloAcc may be conserved and have few variable sites. However, you must be sure the alignments you are estimating rates for with PhyloAcc are also suitable for tree inference if you use this option. Also note that using the--theta
option may significantly add to the runtime of the workflow.
(2) and (3) can be obtained running phyloFit on alignments of likely neutrally evolving sites (e.g. 4-fold degenerate sites in genes) and will be given in a single .mod file output from that program.
PhyloAcc version 2 now facilitates parallelization on computing clusters by using Python to batch loci and Snakemake to submit those batches to the cluster. This results in a three-step process to run PhyloAcc:
-
Process and batch input alignments with the PhyloAcc interface (
phyloacc.py
). -
Submit the batches as jobs on the SLURM cluster (
snakemake [generated snakefile]
). -
Gather outputs from batches into single files (
phyloacc-post.py
).
Below we outline several ways to batch loci ("Setting up batches") as well as examples of how to submit the snakefile and gather the outputs.
phyloacc.py \
-d [directory containing multiple FASTA formatted nucleotide alignments] \
-m [mod file from phyloFit with input tree and neutral rate matrix] \
-o [desired output directory] \
-t "[semi-colon separated list of target branches in the species tree]" \
-j [number of jobs/batches to split the input alignments into] \
-p [processes to use per job/batch of alignments] \
-batch [number of alignments per job/batch] \
-part "[comma separated list of SLURM partitions to submit batches to as jobs]"
phyloacc.py \
-d [directory containing multiple FASTA formatted nucleotide alignments] \
-m [mod file from phyloFit with input tree and neutral rate matrix] \
-r gt \
-l [file with Newick formatted species tree with branch lengths in coalescent units] \
-o [desired output directory] \
-t "[semi-colon separated list of target branches in the species tree]" \
-j [number of jobs/batches to split the input alignments into] \
-p [processes to use per job/batch of alignments] \
-batch [number of alignments per job/batch] \
-part "[comma separated list of SLURM partitions to submit batches to as jobs]"
phyloacc.py \
-d [directory containing multiple FASTA formatted nucleotide alignments] \
-m [mod file from phyloFit with input tree and neutral rate matrix] \
-r adaptive \
-l [file with Newick formatted species tree with branch lengths in coalescent units] \
-o [desired output directory] \
-t "[semi-colon separated list of target branches in the species tree]" \
-j [number of jobs/batches to split the input alignments into] \
-p [processes to use per job/batch of alignments] \
-batch [number of alignments per job/batch] \
-part "[comma separated list of SLURM partitions to submit batches to as jobs]"
snakemake -p -s \
[path to snakefile.smk] \
[path to config file] \
[path to cluster profile] \
--dryrun
Note that all files for the snakemake command are automatically generated when running
phyloacc.py
and the exact command to run will be printed to the screen and
written to the log and summary files.
Always try to run the snakemake
command with the --dryrun
option to catch any errors before the jobs are submitted. After the dry run has completed successfully, remove
--dryrun
from the command to execute the workflow and start job submission to the
cluster.
phyloacc_post.py \
-i [path output directory specified when running phyloacc.py]
After running phyloacc_post.py
, outputs from each batch will be combined and a summary
HTML file will be created with some preliminary summaries of results. This file will be found in the main output
directory from the phyloacc.py
command (specified with the -o
option).
Raw output files are also available in the output directory specified with phyloacc_post.py
,
with the default directory name being results/
.
The raw files are tab delimited and described below.
Marginal log-likelihood for all models (integrating out parameters and latent states), Bayes factors, rates, and states for each locus.
Column header | Column description |
---|---|
phyloacc.id | The number assigned to this locus by PhyloAcc, with the format [batch number]-[locus number] |
original.id | The ID of the locus provided in the input (bed file or fasta file) |
best.fit.model | The model (M0, M1, or M2) that best fits the data for this locus given the specified Bayes Factor cutoffs |
marginal.likelihood.m0 | Marginal log-likelihood under the null model (M0) |
marginal.likelihood.m1 | Marginal log-likelihood under target model (M1) |
marginal.likelihood.m2 | Marginal log-likelihood under the unrestricted full model (M2) |
logbf1 | log Bayes factor between null (M0) and target (M1) models |
logbf2 | log Bayes factor between target (M1) and full (M2) models |
logbf3 | log Bayes factor between full (M2) and null (M0) models |
conserved.rate.m0 | The posterior median of the conserved substitution rate under M0 |
accel.rate.m0 | The posterior median of the accelerated substitution rate under M0 |
conserved.rate.m1 | The posterior median of the conserved substitution rate under M1 |
accel.rate.m1 | The posterior median of the accelerated substitution rate under M1 |
conserved.rate.m2 | The posterior median of the conserved substitution rate under M2 |
accel.rate.m2 | The posterior median of the accelerated substitution rate under M2 |
num.accel.m1 | The number of lineages inferred to be accelerated under M1 |
num.accel.m2 | The number of lineages inferred to be accelerated under M2 |
conserved.lineages.m1 | A comma separated list of the conserved lineages under M1 |
accel.lineages.m1 | A comma separated list of the accelerated lineages under M1 |
conserved.lineages.m2 | A comma separated list of the conserved lineages under M2 |
accel.lineages.m1 | A comma separated list of the accelerated lineages under M1 |
Maximum log-likelihood configurations of latent state Z under null, accelerated and full model, with Z=-1 (if the element is 'missing' in the branches of outgroup species), 0 (background), 1 (conserved), 2 (accelerated).
Each row corresponds to an input element and each column a branch in the tree. If an element is filtered because of too many alignment gaps all the columns will be zero.
Posterior median of conserved rate, accelerated rate, probability of gain and loss conservation (and \(\beta = P(Z=1\rightarrow Z=2)\)), and posterior probability of being in each latent state on each branch for each element.
Column header | Column description |
---|---|
Locus ID | [batch number]-[locus number] |
n_rate | Posterior median of accelerated substitution rate |
c_rate | Posterior median of conserved substitution rate |
g_rate | Posterior median of \( \alpha \) |
l_rate | Posterior median of \( \beta \) |
l2_rate | Posterior median of \( \beta_2 = P(Z = 0 \rightarrow Z = 2) \), which is 0 in current implementation |
From the 7th column and on, there are four columns for each branch in the tree: *_0 indicates whether it's "missing"; *_1, *_2 and *_3 are the posterior probability in the background, conserved and accelerated state respectively. The algorithm will prune "missing" branches within outgroup and set the latent states of them to -1 so that the three posterior probabilities are all zero. Column names indicate the branch and the order of the branch is the same as that in prefix_elem_Z.txt.
Option | Description | Default value |
---|---|---|
-a [FASTA FILE]
|
An alignment file with all loci concatenated. -b must also be specified.
Expected as FASTA format for now.
|
One of -a /-b or -d is REQUIRED.
|
-b [BED FILE]
|
A bed file with coordinates for the loci in the concatenated alignment file.
-a must also be specified.
|
One of -a /-b or -d is REQUIRED.
|
-i [TEXT FILE]
|
A text file with locus names, one per line, corresponding to regions in the input bed file. If provided, PhyloAcc will only be run on these loci. |
Optional. -a and -b must also be specified.
|
-d [DIRECTORY]
|
A directory containing individual alignment files for each locus. Expected as FASTA format for now. |
One of -a /-b or -d is REQUIRED.
|
-m [MOD FILE]
|
A file with a background transition rate matrix and phylogenetic tree with branch lengths as output from phyloFit. | REQUIRED. |
-o [DIRECTORY]
|
Desired output directory. This will be created for you if it doesn't exist. | phyloacc-[date]-[time] |
-t "[STRING]"
|
Tip labels in the input tree to be used as target species. Enter multiple labels separated by semi-colons (;). | REQUIRED. |
-c "[STRING]"
|
Tip labels in the input tree to be used as conserved species. Enter multiple labels separated by semi-colons (;). |
Optional. Any species not specified in -t or -g will be inferred as conserved.
|
-g "[STRING]"
|
Tip labels in the input tree to be used as outgroup species. Enter multiple labels separated by semi-colons (;). | Optional. |
-l [NEWICK FILE]
|
A file containing a rooted, Newick formatted tree with the same topology as the species tree in the mod file (-m ),
but with branch lengths in coalescent units.
|
When the gene tree model is used, one of -l or --theta must be set.
|
--theta
|
Set this to add gene tree estimation with IQ-tree and species estimation with ASTRAL for estimation
of the theta prior. Note that a species tree with branch lengths in units of substitutions per site
is still required with -m . Also note that this may add substantial runtime to the pipeline.
|
When the gene tree model is used, one of -l or --theta must be set.
|
-r [STRING]
|
Determines which version of PhyloAcc will be used. gt: use the gene tree model for all loci, st: use the species tree model for all loci, adaptive: use the gene tree model on loci with many branches with low sCF and species tree model on all other loci. | st |
-n [INT]
|
The number of processes that this script should use. | 1 |
Option | Description | Default value |
---|---|---|
-burnin [INT]
|
The number of steps to be discarded in the Markov chain as burnin. | 500 |
-mcmc [INT]
|
The total number of steps in the Markov chain. | 1000 |
Option | Description | Default value |
---|---|---|
-scf [FLOAT]
|
The value of sCF to consider as low for any given branch or locus. Must be between 0 and 1. | 0.5 |
-s [FLOAT]
|
A value between 0 and 1. If provided, this proportion of branches must have sCF below -scf
to be considered for the gene tree model. Otherwise, branch sCF values will be averaged for each locus.
|
Optional. |
Option | Description | Default value |
---|---|---|
-p [INT]
|
The number of processes to use for each batch of PhyloAcc. | 1 |
-j [INT]
|
The number of jobs (batches) to run in parallel. | 1 |
-batch [INT]
|
The number of loci to run per batch. | 50 |
Option | Description | Default value |
---|---|---|
-part "[STRING]"
|
The SLURM partition or list of partitions (separated by commas) on which to run PhyloAcc jobs. | REQUIRED. |
-nodes [INT]
|
The number of nodes on the specified partition to submit jobs to. | 1 |
-mem [INT]
|
The max memory for each job in GB. | 4 |
-time [INT]
|
The time in hours to give each job. | 1 |
Option | Description | Default value |
---|---|---|
-st-path [STRING]
|
The path to the PhyloAcc-ST binary. |
PhyloAcc-ST
|
-gt-path [STRING]
|
The path to the PhyloAcc-GT binary if -r gt or -r adaptive are set.
|
PhyloAcc-GT
|
-phyloacc "[STRING]"
|
A catch-all option for other PhyloAcc parameters. Enter as a semi-colon delimited list of options: 'OPT1 value;OPT2 value' | Optional. |
--options
|
Print the full list of PhyloAcc options that can be specified with -phyloacc and exit.
|
Optional. |
Option | Description | Default value |
---|---|---|
-iqtree-path "[STRING]"
|
When --theta is set, gene trees will be inferred from some loci with IQ-TREE.
You can provide the path to your iqtree executable with this option
|
iqtree
|
-coal-path "[STRING]"
|
When --theta is set, branch lengths on your species tree will be estimated in coalescent units with an external program. Currently
ASTRAL With this option you can provide the command to execute your astral.jar file,
including any java options. For example, "java -Xmx8g -jar astral.jar " would be a valid command to specify, provided you had a jar file called astral.jar .
|
java -jar astral.jar
|
--labeltree
|
Simply reads the tree from the input mod file (-m ), labels the internal nodes, and exits.
|
Optional. |
--overwrite
|
Set this to overwrite existing files. | Optional. |
--appendlog
|
Set this to keep the old log file even if --overwrite is specified.
New log information will instead be appended to the previous log file.
|
Optional. |
--summarize
|
Only generate the input summary plots and page. Do not write or overwrite batch job files. | Optional. |
--info
|
Print some meta information about the program and exit. No other options required. | Optional. |
--depcheck
|
Run this to check that all dependencies are installed at the provided path. No other options necessary. | Optional. |
--version
|
Simply print the version and exit. Can also be called as -version , -v , or --v .
|
Optional. |
--quiet
|
Set this flag to prevent PhyloAcc from reporting detailed information about each step. | Optional. |
-h
|
Print a help menu and exit. Can also be called as --help .
|
Optional. |