PhyloAcc README

This page contains all info about the PhyloAcc program including its inputs, options, and outputs.

Installation

With Anaconda already setup, Phyloacc can be installed with the command conda install phyloacc. For more detailed installation instructions, see the install page.

Note that PhyloAcc is currently only compatible on Linux and OSX operating systems, though can be run on Windows with the Windows Subsystem for Linux (WSL).

Inputs

You will need the following data to perform an analysis with PhyloAcc

  1. A set of alignments for a set of species from the regions you wish to estimate substitution rates for (e.g. CNEEs).
  2. A phylogenetic tree in Newick format from the same set of species with branch lengths estimated in terms of relative number of substitutions corresponding to neutral/background substitution rates.
  3. A transition rate matrix for bases in the neutral/background model.
  4. (2) and (3) can be obtained running phyloFit on alignments of likely neutrally evolving sites (e.g. 4-fold degenerate sites in genes) and will be given in a single .mod file output from that program.

  5. For running the gene tree model, a phylogenetic tree with the same topology as the one provided in (2), but with branch lengths in coalescent units. This can be obtained from species tree inference methods like MP-EST or ASTRAL. PhyloAcc has the capability to to estimate this directly from the input alignments with the --theta option, by building locus trees using IQ-TREE and then using those as input to ASTRAL. This will be done for at most the 5000 longest alignments that are longer than 100bp and have at least 20% informative sites. These requirements are to ensure there is phylogenetic signal to infer this tree given that some inputs to PhyloAcc may be conserved and have few variable sites. However, you must be sure the alignments you are estimating rates for with PhyloAcc are also suitable for tree inference if you use this option. Also note that using the --theta option may significantly add to the runtime of the workflow.
Usage

PhyloAcc version 2 now facilitates parallelization on computing clusters by using Python to batch loci and Snakemake to submit those batches to the cluster. This results in a three-step process to run PhyloAcc:

  1. Process and batch input alignments with the PhyloAcc interface (phyloacc.py).
  2. Submit the batches as jobs on the SLURM cluster (snakemake [generated snakefile]).
  3. Gather outputs from batches into single files (phyloacc-post.py).

Below we outline several ways to batch loci ("Setting up batches") as well as examples of how to submit the snakefile and gather the outputs.

NOTE: As of v2.3.1, you can now specify options to the phyloacc.py script in two ways:

  1. By passing the options in the command line (e.g. phyloacc.py -d [alignment directory] -m [mod file] -o [output directory] ...), as outlined here.
  2. By specifying the option and value in a config file (e.g. phyloacc.py --config [config file]). See below for more info.
1. Setting up batches of loci for the species tree model with a directory of alignments:
phyloacc.py \
    -d [directory containing multiple FASTA formatted nucleotide alignments] \
    -m [mod file from phyloFit with input tree and neutral rate matrix] \
    -o [desired output directory] \
    -t "[semi-colon separated list of target branches in the species tree]" \
    -j [number of jobs/batches to split the input alignments into] \
    -p [processes to use per job/batch of alignments] \
    -batch [number of alignments per job/batch] \
    -part "[comma separated list of SLURM partitions to submit batches to as jobs]"
2. Setting up batches of loci for the gene tree model with a directory of alignments and a provided tree with branch lengths in coalescent units:
phyloacc.py \
    -d [directory containing multiple FASTA formatted nucleotide alignments] \
    -m [mod file from phyloFit with input tree and neutral rate matrix] \
    -r gt \
    -l [file with Newick formatted species tree with branch lengths in coalescent units] \
    -o [desired output directory] \
    -t "[semi-colon separated list of target branches in the species tree]" \
    -j [number of jobs/batches to split the input alignments into] \
    -p [processes to use per job/batch of alignments] \
    -batch [number of alignments per job/batch] \
    -part "[comma separated list of SLURM partitions to submit batches to as jobs]"
3. Setting up batches of loci with the model determined by sCF cutoffs, with a directory of alignments, and with a provided tree with branch lengths in coalescent units:
phyloacc.py \
    -d [directory containing multiple FASTA formatted nucleotide alignments] \
    -m [mod file from phyloFit with input tree and neutral rate matrix] \
    -r adaptive \
    -l [file with Newick formatted species tree with branch lengths in coalescent units] \
    -o [desired output directory] \
    -t "[semi-colon separated list of target branches in the species tree]" \
    -j [number of jobs/batches to split the input alignments into] \
    -p [processes to use per job/batch of alignments] \
    -batch [number of alignments per job/batch] \
    -part "[comma separated list of SLURM partitions to submit batches to as jobs]"
4. Specifying options with a config file:

Starting with v2.3.1, options for phyloacc.py can be specified either in the command line, as above, or in a configuration file in YAML format and by using the --config [config file] syntax.

We provide a template config file, which lists all the options for phyloacc.py with no values specified: Config file template.

In this file, options are specified in key: value pairs, with the key being the option name to the left of the colon and the desired value being to the right of the colon.

For example, one could run the program with the following command:

phyloacc.py -a alignment.fa -b loci.bed -m model.mod -t "species1;species2;species3"

OR one could specify the same options in a config file named phyloacc-cfg.yaml like so:

alignment: alignment.fa
bed: loci.bed
mod: model.mod
targets: "species1;species2;species3"

And then run the program with the following command:

phyloacc.py --config phyloacc-cfg.yaml

These are identical ways to run the same inputs.

IMPORTANT: Options given via the command line take precedence over those givin in the config file.

For instance, if I use the same config file above, but run the command:

phyloacc.py --config phyloacc-cfg.yaml -a other_alignment.fa

Then the alignment.fa input in the config file will be ignored and other_alignment.fa will be used instead.

If an option is not specified in either the command line or the config file, a default value will be used. See the Options section for more details.

Boolean options in the config file

For boolean options in the config file (theta_flag, dollo_flag, overwrite_flag, labeltree, summarize_flag, filter_alns, options_flag, append_log_flag, info_flag, version_flag, quiet_flag), you can specify them in the config file as either true or false. In other words:
phyloacc.py --config phyloacc-cfg.yaml --overwrite

is equivalent to setting

overwrite_flag: true

in the config file and running

phyloacc.py --config phyloacc-cfg.yaml

In practice, for many of these boolean options, using the command line option may be easier than changing the config file each time.

5. Executing a generated snakefile to submit jobs to the cluster:
snakemake -p -s \
    [path to snakefile.smk] \
    --configfile [path to config file] \
    --profile [path to cluster profile] \
    --dryrun

Note that all files for the snakemake command are automatically generated when running phyloacc.py and the exact command to run will be printed to the screen and written to the log and summary files.

Tip - Always check snakefiles with --dryrun before executing

Always try to run the snakemake command with the --dryrun option to catch any errors before the jobs are submitted. After the dry run has completed successfully, remove --dryrun from the command to execute the workflow and start job submission to the cluster.


6. Gather outputs after all snakemake jobs are completed:
phyloacc_post.py \
    -i [path output directory specified when running phyloacc.py]
Output

After running phyloacc_post.py, outputs from each batch will be combined and a summary HTML file will be created with some preliminary summaries of results. This file will be found in the main output directory from the phyloacc.py command (specified with the -o option).

Raw output files are also available in the output directory specified with phyloacc_post.py, with the default directory name being results/.

The raw files are tab delimited and described below.

elem_lik.txt

Marginal log-likelihood for all models (integrating out parameters and latent states), Bayes factors, rates, and states for each locus.

Column header Column description
phyloacc.id The number assigned to this locus by PhyloAcc, with the format [batch number]-[locus number]
original.id The ID of the locus provided in the input (bed file or fasta file)
best.fit.model The model (M0, M1, or M2) that best fits the data for this locus given the specified Bayes Factor cutoffs
marginal.likelihood.m0 Marginal log-likelihood under the null model (M0)
marginal.likelihood.m1 Marginal log-likelihood under target model (M1)
marginal.likelihood.m2 Marginal log-likelihood under the unrestricted full model (M2)
logbf1 log Bayes factor between null (M0) and target (M1) models
logbf2 log Bayes factor between target (M1) and full (M2) models
logbf3 log Bayes factor between full (M2) and null (M0) models
conserved.rate.m0 The posterior median of the conserved substitution rate under M0
accel.rate.m0 The posterior median of the accelerated substitution rate under M0
conserved.rate.m1 The posterior median of the conserved substitution rate under M1
accel.rate.m1 The posterior median of the accelerated substitution rate under M1
conserved.rate.m2 The posterior median of the conserved substitution rate under M2
accel.rate.m2 The posterior median of the accelerated substitution rate under M2
num.accel.m1 The number of lineages inferred to be accelerated under M1
num.accel.m2 The number of lineages inferred to be accelerated under M2
conserved.lineages.m1 A comma separated list of the conserved lineages under M1
accel.lineages.m1 A comma separated list of the accelerated lineages under M1
conserved.lineages.m2 A comma separated list of the conserved lineages under M2
accel.lineages.m1 A comma separated list of the accelerated lineages under M1
[prefix]_M*_elem_Z.txt

Maximum log-likelihood configurations of latent state Z under null, accelerated and full model, with Z=-1 (if the element is 'missing' in the branches of outgroup species), 0 (background), 1 (conserved), 2 (accelerated).

Each row corresponds to an input element and each column a branch in the tree. If an element is filtered because of too many alignment gaps all the columns will be zero.

[prefix]_rate_postZ_M*.txt

Posterior median of conserved rate, accelerated rate, probability of gain and loss conservation (and \(\beta = P(Z=1\rightarrow Z=2)\)), and posterior probability of being in each latent state on each branch for each element.

Column header Column description
Locus ID [batch number]-[locus number]
n_rate Posterior median of accelerated substitution rate
c_rate Posterior median of conserved substitution rate
g_rate Posterior median of \( \alpha \)
l_rate Posterior median of \( \beta \)
l2_rate Posterior median of \( \beta_2 = P(Z = 0 \rightarrow Z = 2) \), which is 0 in current implementation

From the 7th column and on, there are four columns for each branch in the tree: *_0 indicates whether it's "missing"; *_1, *_2 and *_3 are the posterior probability in the background, conserved and accelerated state respectively. The algorithm will prune "missing" branches within outgroup and set the latent states of them to -1 so that the three posterior probabilities are all zero. Column names indicate the branch and the order of the branch is the same as that in prefix_elem_Z.txt.

Options

Below are the options available for the phyloacc.py script. The REQUIRED options are:

  • Sequences: one of (-a and -b) or -d
  • Tree and model: -m
  • Target species: -t
  • Cluster partition: -part

All other parameters are optional or have default values. Options can be specified in the command line or in a config file. While they are optional, it is encouraged to optimize the workflow by specifying the options that are relevant to your analysis. Specifically, a named output directory (-o) will help you organize your files. And the batching and cluster options should be set to match the resources available to you.

Also note that while a cluster partition is required to be specified with -part, if you do not wish to run the resulting batches on a cluster you can specify any string since no checks are done to ensure the partition exists. Then you can run the batches via snakemake without specifying the --profile option, or run the batches individually with the config files generated in [your output directory]/phyloacc-job-files/cfgs/. Though due to the run-time of the model, it his highly recommended to run the batches on a cluster with dedicated compute nodes.

Sequence input options
Command line option Config file key Description Default value
-a [FASTA FILE] aln_file: [FASTA FILE] An alignment file with all loci concatenated. -b must also be specified. Expected as FASTA format for now. One of (-a and -b) or -d is REQUIRED.
-b [BED FILE] bed_file: [BED FILE] A bed file with coordinates for the loci in the concatenated alignment file. -a must also be specified. One of (-a and -b) or -d is REQUIRED.
-i [TEXT FILE] id_file: [TEXT FILE] A text file with locus names, one per line, corresponding to regions in the input bed file. If provided, PhyloAcc will only be run on these loci. Optional. -a and -b must also be specified.
-d [DIRECTORY] aln_dir: [DIRECTORY] A directory containing individual alignment files for each locus. Expected as FASTA format for now. One of (-a and -b) or -d is REQUIRED.
Tree input options
Command line option Config file key Description Default value
-m [MOD FILE] mod_file: [MOD FILE] A file with a background transition rate matrix and phylogenetic tree with branch lengths as output from phyloFit. REQUIRED.
-t "[STRING]" targets: [STRING] Tip labels in the input tree to be used as target species. Enter multiple labels separated by semi-colons (;). REQUIRED.
-c "[STRING]" conserved: [STRING] Tip labels in the input tree to be used as conserved species. Enter multiple labels separated by semi-colons (;). Optional. Any species not specified in -t or -g will be set as conserved.
-g "[STRING]" outgroup: [STRING] Tip labels in the input tree to be used as outgroup species. Enter multiple labels separated by semi-colons (;). Optional.
-l [NEWICK FILE] coal_tree: [NEWICK FILE] A file containing a rooted, Newick formatted tree with the same topology as the species tree in the mod file (-m), but with branch lengths in coalescent units. When the gene tree model is used, one of -l or --theta must be set.
--theta theta_flag: [true/false] Set this to add gene tree estimation with IQ-tree and species estimation with ASTRAL for estimation of the theta prior. Note that a species tree with branch lengths in units of substitutions per site is still required with -m. Also note that this may add substantial runtime to the pipeline. When the gene tree model is used, one of -l or --theta must be set.
PhyloAcc method options
Command line option Config file key Description Default value
-r [st/gt/adaptive] run_mode: [st/gt/adaptive] Determines which version of PhyloAcc will be used. st: use the species tree model for all loci, gt: use the gene tree model for all loci, adaptive: use the gene tree model on loci with many branches with low sCF and species tree model on all other loci. st
--dollo dollo_flag: [true/false] Set this to use the Dollo assumption in the original model, which assumes that once a lineage has been inferred to be in an accelerated state, it and its descendants cannot change state. By default, this assumption is no longer used (false). false
Other input options
Command line option Config file key Description Default value
--config [YAML FILE] NA The path to a YAML formatted configuration file that specifies the program options (see above). Note that options specified on the command line take precedence of those specified in the config file. Optional
-n [INT] num_procs: [INT] The number of processes that this script should use. 1
Output options
Command line option Config file key Description Default value
-o [DIRECTORY] out_dir: [DIRECTORY] Desired output directory. This will be created for you if it doesn't exist. phyloacc-[date]-[time]
--overwrite overwrite_flag: [true/false] Set this to overwrite existing files. Optional.
--labeltree labeltree: [true/false] Simply reads the tree from the input mod file (-m), labels the internal nodes, and exits. Optional.
--summarize summarize_flag: [true/false] Only generate the input summary plots and page. Do not write or overwrite batch job files. Optional.
Alignment options
Command line option Config file key Description Default value
--filter filter_alns: [true/false] By default, any locus with at least 1 informative site is reatained for PhyloAcc. Set this to filter out loci that have at least 50% of sites that are 50% or more gap charcaters OR that have 50% of sequences that are made up of 50% or more gap charcaters. Optional.
sCF options
Command line option Config file key Description Default value
-scf [FLOAT] scf_branch_cutoff: [FLOAT] The value of sCF to consider as low for any given branch or locus. Must be between 0 and 1. 0.5
-s [FLOAT] scf_prop: [FLOAT] A value between 0 and 1. If provided, this proportion of branches must have sCF below -scf to be considered for the gene tree model. Otherwise, branch sCF values will be averaged for each locus. Optional.
MCMC options
Command line option Config file key Description Default value
-mcmc [INT] mcmc: [INT] The total number of steps in the Markov chain. 1000
-burnin [INT] burnin: [INT] The number of steps to be discarded in the Markov chain as burnin. 500
-chain [INT] chain: [INT] The number of MCMC chains to run. 1
-thin [INT] thin: [INT] For the gene tree model, the number of MCMC steps between gene tree sampling. The total number of MCMC steps specified with -mcmc will be scaled by this as mcmc*thin. 1
Batching and cluster options
Command line option Config file key Description Default value
-batch [INT] batch_size: [INT] The number of loci to run per batch. 50
-p [INT] procs_per_batch [INT] The number of processes to use for each batch of PhyloAcc. 1
-j [INT] num_jobs: [INT] The number of jobs (batches) to run in parallel. 1
-part "[STRING]" cluster_part: [STRING] The SLURM partition or list of partitions (separated by commas) on which to run PhyloAcc jobs. REQUIRED.
-nodes [INT] cluster_nodes: [INT] The number of nodes on the specified partition to submit jobs to. 1
-mem [INT] cluster_mem: [INT] The max memory for each job in MB. 4
-time [INT] cluster_time [INT] The time in minutes to give each job. 1
--local local_flag [true/false] Set this to generate a local snakemake workflow rather than one that will be submitted through SLURM with a profile. For testing purposes. Optional.
Executable path options
Command line option Config file key Description Default value
-st-path [STRING] phyloacc_st_path [STRING] The path to the PhyloAcc-ST binary. PhyloAcc-ST
-gt-path [STRING] phyloacc_gt_path [STRING] The path to the PhyloAcc-GT binary if -r gt or -r adaptive are set. PhyloAcc-GT
-iqtree-path "[STRING]" iqtree_path: [STRING] When --theta is set, gene trees will be inferred from some loci with IQ-TREE. You can provide the path to your iqtree executable with this option iqtree
-coal-path "[STRING]" coal_cmd: [STRING] When --theta is set, branch lengths on your species tree will be estimated in coalescent units with an external program. Currently ASTRAL With this option you can provide the command to execute your astral.jar file, including any java options. For example, "java -Xmx8g -jar astral.jar" would be a valid command to specify, provided you had a jar file called astral.jar. java -jar astral.jar
Other PhyloAcc options
Command line option Config file key Description Default value
-phyloacc "[STRING]" phyloacc_opts: [STRING] A catch-all option for other PhyloAcc parameters. Enter as a semi-colon delimited list of options: 'OPT1 value;OPT2 value' Optional.
--options options_flag: [true/false] Print the full list of PhyloAcc options that can be specified with -phyloacc and exit. Optional.
Miscellaneous options
Command line option Config file key Description Default value
--testcmd testcmd_flag: [true/false] Also print out a command to run a single batch of PhyloAcc directly (without snakemake). For testing purposes. Optional.
--depcheck depcheck: [true/false] Run this to check that all dependencies are installed at the provided path. No other options necessary. Optional.
--appendlog append_log_flag: [true/false] Set this to keep the old log file even if --overwrite is specified. New log information will instead be appended to the previous log file. Optional.
--info info_flag: [true/false] Print some meta information about the program and exit. No other options required. Optional.
--version version_flag: [true/false] Simply print the version and exit. Can also be called as -version, -v, or --v. Optional.
--quiet quiet_flag: [true/false] Set this flag to prevent PhyloAcc from reporting detailed information about each step. Optional.
-h NA Print a help menu and exit. Can also be called as --help. Optional.