I’ve just finished, what was for me, a difficult compiler/packaging attempt – creating a working bioconda package for seer. You can look at the pull request to see how many times I failed: https://github.com/bioconda/bioconda-recipes/pull/11263

(I would note I have made this package for legacy users. I would direct anyone interested in the software itself to the reimplementation pyseer)

The reason this was difficult was due to my own inclusion of a number of packages, all of which also need compilation, further adding to the complexity. While most of them were already available in some form in anaconda it turned out there were some issues with using the defaults. I want to note some of the things I had to modify here.

gcc and glibc version

seer requires gcc >4.8 to compile, and glibc > 4.9 to run. The default compiler version in conda is 4.8. Simply add a ‘conda_build_config.yaml’ file containing:

  - gcc # [linux]
  - gxx # [linux]


I had dlib and gzstream as submodules. If you use a git_url as the source these clone recursively, but not with a standard url in meta.yaml. I needed to do ‘git clone –recursive’ with repository and tarball it myself to include these directories in the git hub release.


Is not available on the bioconda channels so I had to compile myself. I included this as a submodule, but rather than using the default Makefile I needed to add the conda defined compiler flags to ensure these were consistent with later steps (particularly -fPIC in CPPFLAGS).



I was attempting to link boost_program_options using either the boost or boost-cpp anaconda packages, which unlike most boost libraries requires compiling. This led to undefined symbols at the linking stage, which I think are due to incompatible compiler (options) used to make the dynamic libraries in the versions currently available on anaconda. This turned out to be the most difficult thing to fix, requiring me to compile boost as part of the recipe.

Rather than downloading and compiling everything, I followed the boost github examples and made a shallow clone, with a fully copy of the boost library I’m using:

git clone --depth 1 https://github.com/boostorg/boost.git boost
rmdir libs/program_options
cd boost
git clone --depth 50 https://github.com/boostorg/program_options.git libs/program_options
git submodule update -q --init tools/boostdep
git submodule update -q --init tools/build
git submodule update -q --init tools/inspect

I then included this in the release tarball. A better way may be to use submodules so this is done automatically using –recursive.

This library needed to be built, but I did so in a work directory to avoid installing unexpected packages with the recipe. Following the conda-forge build.sh for boost-cpp:

pushd boost
python2 tools/boostdep/depinst/depinst.py program_options --include example
cat < tools/build/src/site-config.jam
using gcc : custom : ${CXX} ;
./bootstrap.sh --prefix="${BOOST_BUILT}" --with-libraries=program_options --with-toolset=gcc
./b2 -q \
variant=release \
address-model="${ARCH}" \
architecture=x86 \
debug-symbols=off \
threading=multi \
runtime-link=shared \
link=static,shared \
toolset=gcc-custom \
include="${INCLUDE_PATH}" \
cxxflags="${CXXFLAGS}" \
linkflags="${LINKFLAGS}" \
--layout=system \
-j"${CPU_COUNT}" \

The python2 line sorts out the header libraries required to compile, not included in the shallow clone. The rest are standard methods to install boost, ensuring the same compiler flags as the other compiled code and using the conda compiler.

I then needed to link this boost library statically (leaving the rest dynamic), so modified the make line as follows:

  SEER_LDLIBS="-L../gzstream -L${BOOST_BUILT}/lib -L/usr/local/hdf5/lib \
  -lhdf5 -lgzstream -lz -larmadillo -Wl,-Bstatic -lboost_program_options \
  -Wl,-Bdynamic -lopenblas -llapack -lpthread"


The final trick was linking armadillo correctly. Confusingly it built and linked ok, tested ok locally, but on the bioconda CI I got undefined symbols to lapack at runtime:

seer: symbol lookup error: seer: undefined symbol: wrapper_dgbsv_

This was due to armadillo’s wrapper around its include which links in the versions of blas/openblas and lapack defined at the time it was compiled, which I think must be slightly different from what is now included with the armadillo package dependencies on conda. Easy enough to fix, use a compiler flag to turn the wrapper off and link the libraries manually:

  LDFLAGS="${LDFLAGS} -larmadillo -lopenblas -llapack"

After all of that, it finally worked!


I was getting the following error, attempting to run conda-build on a package, using a conda env:

Traceback (most recent call last):
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/bin/conda-build", line 7, in <module>
from conda_build.cli.main_build import main
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/cli/main_build.py", line 18, in <module>
import conda_build.api as api
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/api.py", line 22, in <module>
from conda_build.config import Config, get_or_merge_config, DEFAULT_PREFIX_LENGTH as _prefix_length
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/config.py", line 17, in <module>
from .variants import get_default_variant
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/variants.py", line 15, in <module>
from conda_build.utils import ensure_list, trim_empty_keys, get_logger
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/utils.py", line 10, in <module>
import libarchive
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/__init__.py", line 1, in <module>
from .entry import ArchiveEntry
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/entry.py", line 6, in <module>
from . import ffi
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/ffi.py", line 27, in <module>
libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/ctypes/__init__.py", line 426, in LoadLibrary
return self._dlltype(name)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/ctypes/__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive: cannot open shared object file: No such file or directory

The /lib directory does contain libarchive, both as a dynamic (.so) and static (.a) library. There turned out to be two relevant environment variables:


A workaround is then to run

export LIBARCHIVE=/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive.so

There is probably a proper reason why this has happened and a permanent solution to the issue, but this works for now.

In our software PopPUNK we draw a plot of a Gaussian mixture model that uses both the implementation and the excellent example in the scikit-learn documentation:

GMM ellipse example

Gaussian mixture model with mixture components plotted as ellipses

My input is 2D distance, which I first use StandardScaler to normalise each axes between 0 and 1, which helps standardise methods across various parts of the code. This is fine if you then create these plots in the scaled space, and as it is a simple linear scaling it is generally trivial to convert back into the original co-ordinates:

# scale X
scale = np.amax(X, axis = 0)
scaled_X = X / scale
# plot scaled
plt.scatter([(scaled_X)[Y == i, 0]], [(scaled_X)[Y == i, 1]], .4, color=color)
# plot original
plt.scatter([(scaled_X*scale)[Y == i, 0]], [(scaled_X*scale)[Y == i, 1]], .4, color=color)

The only thing that wasn’t obvious was how to scale the covariances, which are used to draw the ellipses. I knew they needed to be multiplied by the scale twice as they are variances (squared), and had a vague memory of something like xAx^T from transforming ellipses/conic sections in one of my first year maths courses. Happily that was enough to do a search, turning up an explanation on stackoverflow which confirmed this vague memory: https://math.stackexchange.com/questions/1184523/scaling-a-covariance-matrix

Here is the code for making the plot and incorporating a linear scaling

color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold','darkorange'])

fig=plt.figure(figsize=(11, 8), dpi= 160, facecolor='w', edgecolor='k')
splot = plt.subplot(1, 1, 1)
for i, (mean, covar, color) in enumerate(zip(means, covariances, color_iter)):
    scaled_covar = np.matmul(np.matmul(np.diag(scale), covar), np.diag(scale).T)
    v, w = np.linalg.eigh(scaled_covar)
    v = 2. * np.sqrt(2.) * np.sqrt(v)
    u = w[0] / np.linalg.norm(w[0])
    # as the DP will not use every component it has access to
    # unless it needs it, we shouldn't plot the redundant
    # components.
    if not np.any(Y == i):
    plt.scatter([(X)[Y == i, 0]], [(X)[Y == i, 1]], .4, color=color)

# Plot an ellipse to show the Gaussian component
    angle = np.arctan(u[1] / u[0])
    angle = 180. * angle / np.pi # convert to degrees
    ell = mpl.patches.Ellipse(mean*scale, v[0], v[1], 180. + angle, color=color)


I recently read a pre-print from the Veening lab where they had reconstructed various (22 total) physiological conditions in vitro and then measured expression levels with RNA-seq. I thought it was a great bit of research, and would encourage you to read it here if you’re interested: https://doi.org/10.1101/283739

They’ve also done a really good job with data availability, having released a browser for their data (PneumoExpress), and they have put their raw data on zenodo.

I was interested in trying to replicate their expression module analysis, which was performed using WGCNA. I first downloaded the co-expression matrix, which is split across five CSV files in the zenodo archive. I then basically ran WGCNA exactly as suggested in their tutorial. I then used the annotations from the lab’s D39V genome to get final expression modules.

The expression modules look sensible to me, with the comE master regulator splitting the expression of the competence pathway and associated genes as expected. The modules I predicted can be found here.

Thanks to the authors for making their data so accessible! If you find this useful I would also note the authors also analysed their data with WGCNA, probably in a more optimal and error-checked way than I have, so most likely they have better clusters.

Here is a gist of the code I used:

# See:
# https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-02-networkConstr-man.pdf
SPV_2_name = read.delim("SPV_2_name.txt", header=FALSE)
colnames(SPV_2_name) = c("SPV", "gene")
# co-expression matrix from Veening lab
# https://www.biorxiv.org/content/early/2018/03/22/283739
# data: https://zenodo.org/record/1157923
# once
`18_matrix_long_5` <- read.csv("18_matrix_long_5.csv", stringsAsFactors=FALSE)
`18_matrix_long_4` <- read.csv("18_matrix_long_4.csv", stringsAsFactors=FALSE)
`18_matrix_long_3` <- read.csv("18_matrix_long_3.csv", stringsAsFactors=FALSE)
`18_matrix_long_2` <- read.csv("18_matrix_long_2.csv", stringsAsFactors=FALSE)
`18_matrix_long_1` <- read.csv("18_matrix_long_1.csv", stringsAsFactors=FALSE)
all = rbind(`18_matrix_long_1`, `18_matrix_long_2`, `18_matrix_long_3`, `18_matrix_long_4`, `18_matrix_long_5`)
all_M = matrix(nrow = 2152, ncol = 2152, dimnames=list(origin=unique(all$gene1), dev=unique(all$gene2)))
all_M[cbind(all$gene1, all$gene2)] = all$correlation
saveRDS(all_M, "coexpression_matrix.Rds")
# subsequently
all_M = readRDS("coexpression_matrix.Rds")
# Run WGCNA as in tutorial
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold.fromSimilarity(all_M, powerVector = powers, verbose = 5)
sizeGrWindow(9, 5)
par(mfrow = c(1,2))
cex1 = 0.9
plot(sft$fitIndices[,1], sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(sft$fitIndices[,1], sign(sft$fitIndices[,3])*sft$fitIndices[,2],
# this line corresponds to using an R^2 cut-off of h
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
softPower = 5
adjacency = adjacency.fromSimilarity(all_M, power = softPower)
# Turn adjacency into topological overlap
TOM = TOMsimilarity(adjacency);
dissTOM = 1TOM
# Call the hierarchical clustering function
geneTree = hclust(as.dist(dissTOM), method = "average")
# Plot the resulting clustering tree (dendrogram)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04)
# We like large modules, so we set the minimum module size relatively high:
minModuleSize = 10;
# Module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize);
# Convert numeric lables into colors
dynamicColors = labels2colors(dynamicMods)
# Plot the dendrogram and colors underneath
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
colorOrder = c("grey", standardColors(50))
moduleLabels = match(dynamicColors, colorOrder)1
modules = data.frame(SPV=rownames(all_M), module=moduleLabels)
labelled_modules = merge(modules, SPV_2_name, by.x = "SPV", by.y = "SPV", all.x = TRUE)
write.table(labelled_modules, "WGCNA_modules.tsv", quote = F, sep = "\t", row.names = F, col.names = F)
#### !
# Can't do this with similarity
#### !
# Not run
# Calculate eigengenes
MEList = moduleEigengenes(all_M, colors = dynamicColors)
MEs = MEList$eigengenes
# Calculate dissimilarity of module eigengenes
MEDiss = 1cor(MEs);
# Cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = "average");
# Plot the result
sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
xlab = "", sub = "")

view raw


hosted with ❤ by GitHub

I saw this phylogenetics package today, phyx: https://github.com/FePhyFoFum/phyx

To install without admin rights/sudo I needed to do the following (my software is installed in my home ~/software, rather than e.g. /usr, /usr/local):

Compile armadillo as follows

cmake -DINSTALL_PREFIX=$(HOME)/software
make install

Compile nlopt as follows

./configure --with-cxx --without-octave --without-matlab --prefix=$(HOME)/software
make install

Compile phyx as follows (slightly hacky, maybe there’s a ‘proper’ way)

./configure --prefix=$(HOME)/software

change line 11 of the Makefile (CPP_LIBS) to add the library path:

CPP_LIBS = -L$(HOME)/software/lib -llapack -lblas -lpthread -lm

change line 23 of the Makefile (OPT_FLAGS) to add the include path/CPPFLAGS:

OPT_FLAGS := -O3 -std=c++11 -fopenmp -I$(HOME)/software/include

then you can run

make install




Trying to install PyVCF under a python (3) virtual environment gave me the following error:

(venv)johnlees@hpc:~$ pip install pyvcf
 Downloading/unpacking pyvcf
 Downloading PyVCF-0.6.8.linux-x86_64.tar.gz (1.1MB): 1.1MB downloaded
 Saved /tmp/downloadcache/PyVCF-0.6.8.linux-x86_64.tar.gz
 Running setup.py egg_info for package pyvcf
 Traceback (most recent call last):
 File "", line 16, in 
 FileNotFoundError: [Errno 2] No such file or directory: '~/venv/build/pyvcf/setup.py'
 Complete output from command python setup.py egg_info:
 Traceback (most recent call last):

File "", line 16, in

FileNotFoundError: [Errno 2] No such file or directory: '~/venv/build/pyvcf/setup.py'

The solution was to upgrade setuptools:

pip install --upgrade setuptools

I went from 0.6.34 to 36.8.0 (apparently!)

Marco Galardini and I have recently reimplemented the bacterial GWAS software SEER in python. As part of this I rewrote my C++ code for Firth regression in python. Firth regression gives better estimates when data in logistic regression is separable or close to separable (when a chi-squared contingency table has small entries).

I found that although there is an R implementation logistf I couldn’t find an equivalent in another language, or python’s statsmodels. Here is a gist with my python functions and a skeleton of how to use them and calculate p-values, in case anyone would like to use this in future without having to write the optimiser themselves.

#!/usr/bin/env python
'''Python implementation of Firth regression by John Lees
See https://www.ncbi.nlm.nih.gov/pubmed/12758140''&#39;
def firth_likelihood(beta, logit):
return (logit.loglike(beta) + 0.5*np.log(np.linalg.det(logit.hessian(beta))))
# Do firth regression
# Note information = -hessian, for some reason available but not implemented in statsmodels
def fit_firth(y, X, start_vec=None, step_limit=1000, convergence_limit=0.0001):
logit_model = smf.Logit(y, X)
if start_vec is None:
start_vec = np.zeros(X.shape[1])
beta_iterations = []
for i in range(0, step_limit):
pi = logit_model.predict(beta_iterations[i])
W = np.diagflat(np.multiply(pi, 1pi))
var_covar_mat = np.linalg.pinv(logit_model.hessian(beta_iterations[i]))
# build hat matrix
rootW = np.sqrt(W)
H = np.dot(np.transpose(X), np.transpose(rootW))
H = np.matmul(var_covar_mat, H)
H = np.matmul(np.dot(rootW, X), H)
# penalised score
U = np.matmul(np.transpose(X), y pi + np.multiply(np.diagonal(H), 0.5 pi))
new_beta = beta_iterations[i] + np.matmul(var_covar_mat, U)
# step halving
j = 0
while firth_likelihood(new_beta, logit_model) > firth_likelihood(beta_iterations[i], logit_model):
new_beta = beta_iterations[i] + 0.5*(new_beta beta_iterations[i])
j = j + 1
if (j > step_limit):
sys.stderr.write('Firth regression failed\n')
return None
if i > 0 and (np.linalg.norm(beta_iterations[i] beta_iterations[i1]) < convergence_limit):
return_fit = None
if np.linalg.norm(beta_iterations[i] beta_iterations[i1]) >= convergence_limit:
sys.stderr.write('Firth regression failed\n')
# Calculate stats
fitll = firth_likelihood(beta_iterations[1], logit_model)
intercept = beta_iterations[1][0]
beta = beta_iterations[1][1:].tolist()
bse = np.sqrt(np.diagonal(np.linalg.pinv(logit_model.hessian(beta_iterations[1]))))
return_fit = intercept, beta, bse, fitll
return return_fit
if __name__ == "__main__":
import sys
import warnings
import math
import statsmodels
import numpy as np
from scipy import stats
import statsmodels.api as smf
# create X and y here. Make sure X has an intercept term (column of ones)
# …
# How to call and calculate p-values
(intercept, beta, bse, fitll) = fit_firth(y, X)
beta = [intercept] + beta
# Wald test
waldp = []
for beta_val, bse_val in zip(beta, bse):
waldp.append(2 * (1 stats.norm.cdf(abs(beta_val/bse_val))))
lrtp = []
for beta_idx, (beta_val, bse_val) in enumerate(zip(beta, bse)):
null_X = np.delete(X, beta_idx, axis=1)
(null_intercept, null_beta, null_bse, null_fitll) = fit_firth(y, null_X)
lrstat = 2*(null_fitll fitll)
lrt_pvalue = 1
if lrstat > 0: # non-convergence
lrt_pvalue = stats.chi2.sf(lrstat, 1)

In GWAS the Bayesian Sparse Linear Mixed Model (BSLMM) is a hybrid of the LMM, which assumes all SNPs have an effect size drawn from a normal distribution (closer to ridge regression), and sparse regression which finds a few SNPs with non-zero effect sizes.

In their paper on this model Zhou et al show that this hybrid method can have better prediction accuracy than either individual model on its own (which are special cases in their model), and can also estimate the proportion of variance explained by polygenic and sparse effects.

They implement this in their software gemma, which I (and many others) have used to perform GWAS using a LMM. It’s easy to use once you’ve got the data formatted, though memory usage is quite high (>50Gb for N =~ 5000; M =~ 600 000. Perhaps I should be LD-pruning?)

gemma -bfile data_in -bslmm 1 -o data_out_bslmm_1

However, when running this command I got one of two errors at either the Eigen-decomposition stage, or the UtX stage. ‘Segmentation fault’ or ‘Invalid instruction’. I’ve used the gemma static executable (the default download) before without any issues, so it looked to me like it was probably something with the linear algebra libraries.

Downloading the source, it turns out gemma relies on blas, lapack, gsl and atlas (and optionally arpack). When I tried to compile using the default Makefile I got the following errors at the linking stage:

/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `write_float'
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `signbitq'
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `finiteq'
/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `determine_en
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1213: undefined
reference to `finiteq'
/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `write_float'
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `isnanq'
collect2: error: ld returned 1 exit status
make: *** [bin/gemma] Error 1

Fortunately these scary errors were familiar to me from wrangling with statically compiling seer. To fix this, compile gemma from source and run the BSLMM without errors add -lquadmath to the following line in the Makefile

LIBS_LNX_S_LAPACK = /usr/lib/lapack/liblapack.a -lgfortran -lquadmath /usr/lib/atlas-base/libatlas.a /usr/lib/libblas/libblas.a -Wl,--allow-multiple-definition

Happy modelling

Tanglegrams are a visual method to compare two phylogenetic trees with the same set of tip labels. This can be useful for comparing trees produced by different methods on the same alignment, or on different alignments of the sample set. Tanglegrams work by connecting the matching tips of the trees, then rotating subtrees to minimise the number of crossings. The algorithm was published in 2011, and continues to be used in a range of publications (for example genomic epidemiology).

I recently ran into an issue using tanglegrams which I’d like to document here. I constructed phylogenies from the protein alignment of all the alleles of PspA identified in a carriage collection of Streptococcus pneumoniae using RAxML (a maximum likelihood method) and mrbayes (a Bayesian method which produces a posterior of trees). For the latter, one can use the R package treescape to find the median tree in the posterior distribution. I wanted to compare the shape of these two trees to see if there were any obvious areas of uncertainty in the topology (a perhaps more quantitative way to do this might be by looking at bootstrap values). I made a tanglegram, from which my initial interpretation was trees were very similar:


pspA allele trees. Tanglegram between maximum likelihood tree and median from the Bayesian posterior.

There is only one clade of swaps (near the bottom), so the trees must be similar? Not necessarily. A game I like to play is manually rotating subtrees to make metadata cluster. Though the topology remains identical, visually the clustering of metadata can change drastically. Even though we know vertical distance is meaningless in a phylogeny, the way this is usually visually displayed can lead us to draw erroneous conclusions.


Rotating subtrees. Left: nice well defined clades clustering together. Right: same tree topology and labels, but the clades no longer form single clusters. Clusters 1 and 4 are polyphyletic, and rotating subtrees can break up or combine these vertically.

Let’s look at the tanglegram above again in more detail:


Misleading tanglegram? This emphasises topology changes in more recent parts of the phylogeny (blue) and may be misleading about similarity of ancestral branches (red).

The swap highlighted in blue is the only event where tip lines cross. This correctly leads us to identify that pspA-31 and pspA-37 are in slightly different places in this part of the phylogeny. However look in the red box. Those these strains totally align, the placement of them within the phylogeny is quite different (red arrows)! In this case I don’t think the line crossings by the tanglegram have identified what I wanted to see here, which is that the ancestral topology of these two trees is different.

While I am not claiming tanglegrams are not useful, without careful attention they may be misleading. A good alternative to tanglegrams, in my opinion, is phylo.io. This allows you to dynamically compare two tree topologies, and see how each branch flip affects the arrangement rather than providing just one static comparison. If you want to try this out on the trees above the two newick files are attached at the end of this post to copy in. Finally, I would also note that there are various metrics which can provide a quantitative comparison between (many) trees, my favourite of which is the Kendall-Colijin metric implemented in treescape.

Update: Michelle Kendall (an author of both the KC metric and treescape) pointed out on twitter that they have added the function plotTreeDiff() as another way to look at topology changes. I gave this a go:


plotTreeDiff result. The two trees, with tips coloured based on how many differences in recent common ancestors there are with other tips.

From the manual:
A plot of the two trees side by side. Tips are coloured in the following way:

  • if each ancestor of a tip in tree 1 occurs in tree 2 with the same partition of tip descendants, then the tip is coloured grey (or supplied “baseCol”)
  • if not, the tip gets coloured pale orange to red on a scale according to how many differences there are amongst its most recent common ancestors with other tips. The colour spectrum can be changed according to preference.’

You can use tipDiff() to see the numeric results direcly.