In GWAS the Bayesian Sparse Linear Mixed Model (BSLMM) is a hybrid of the LMM, which assumes all SNPs have an effect size drawn from a normal distribution (closer to ridge regression), and sparse regression which finds a few SNPs with non-zero effect sizes.

In their paper on this model Zhou et al show that this hybrid method can have better prediction accuracy than either individual model on its own (which are special cases in their model), and can also estimate the proportion of variance explained by polygenic and sparse effects.

They implement this in their software gemma, which I (and many others) have used to perform GWAS using a LMM. It’s easy to use once you’ve got the data formatted, though memory usage is quite high (>50Gb for N =~ 5000; M =~ 600 000. Perhaps I should be LD-pruning?)

gemma -bfile data_in -bslmm 1 -o data_out_bslmm_1

However, when running this command I got one of two errors at either the Eigen-decomposition stage, or the UtX stage. ‘Segmentation fault’ or ‘Invalid instruction’. I’ve used the gemma static executable (the default download) before without any issues, so it looked to me like it was probably something with the linear algebra libraries.

Downloading the source, it turns out gemma relies on blas, lapack, gsl and atlas (and optionally arpack). When I tried to compile using the default Makefile I got the following errors at the linking stage:

/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `write_float'
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `signbitq'
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `finiteq'
/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `determine_en
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1213: undefined
reference to `finiteq'
/software/hgi/pkglocal/gcc-4.9.1/lib/gcc/x86_64-unknown-linux-gnu/4.9.1/../../../../lib64/libgfortran.a(write.o): In function `write_float'
/nfs/humgen01/teams/hgi/software/src/gcc-4.9.1/x86_64-unknown-linux-gnu/libgfortran/../.././libgfortran/io/write_float.def:1300: undefined
reference to `isnanq'
collect2: error: ld returned 1 exit status
make: *** [bin/gemma] Error 1

Fortunately these scary errors were familiar to me from wrangling with statically compiling seer. To fix this, compile gemma from source and run the BSLMM without errors add -lquadmath to the following line in the Makefile

LIBS_LNX_S_LAPACK = /usr/lib/lapack/liblapack.a -lgfortran -lquadmath /usr/lib/atlas-base/libatlas.a /usr/lib/libblas/libblas.a -Wl,--allow-multiple-definition

Happy modelling


Tanglegrams are a visual method to compare two phylogenetic trees with the same set of tip labels. This can be useful for comparing trees produced by different methods on the same alignment, or on different alignments of the sample set. Tanglegrams work by connecting the matching tips of the trees, then rotating subtrees to minimise the number of crossings. The algorithm was published in 2011, and continues to be used in a range of publications (for example genomic epidemiology).

I recently ran into an issue using tanglegrams which I’d like to document here. I constructed phylogenies from the protein alignment of all the alleles of PspA identified in a carriage collection of Streptococcus pneumoniae using RAxML (a maximum likelihood method) and mrbayes (a Bayesian method which produces a posterior of trees). For the latter, one can use the R package treescape to find the median tree in the posterior distribution. I wanted to compare the shape of these two trees to see if there were any obvious areas of uncertainty in the topology (a perhaps more quantitative way to do this might be by looking at bootstrap values). I made a tanglegram, from which my initial interpretation was trees were very similar:


pspA allele trees. Tanglegram between maximum likelihood tree and median from the Bayesian posterior.

There is only one clade of swaps (near the bottom), so the trees must be similar? Not necessarily. A game I like to play is manually rotating subtrees to make metadata cluster. Though the topology remains identical, visually the clustering of metadata can change drastically. Even though we know vertical distance is meaningless in a phylogeny, the way this is usually visually displayed can lead us to draw erroneous conclusions.


Rotating subtrees. Left: nice well defined clades clustering together. Right: same tree topology and labels, but the clades no longer form single clusters. Clusters 1 and 4 are polyphyletic, and rotating subtrees can break up or combine these vertically.

Let’s look at the tanglegram above again in more detail:


Misleading tanglegram? This emphasises topology changes in more recent parts of the phylogeny (blue) and may be misleading about similarity of ancestral branches (red).

The swap highlighted in blue is the only event where tip lines cross. This correctly leads us to identify that pspA-31 and pspA-37 are in slightly different places in this part of the phylogeny. However look in the red box. Those these strains totally align, the placement of them within the phylogeny is quite different (red arrows)! In this case I don’t think the line crossings by the tanglegram have identified what I wanted to see here, which is that the ancestral topology of these two trees is different.

While I am not claiming tanglegrams are not useful, without careful attention they may be misleading. A good alternative to tanglegrams, in my opinion, is This allows you to dynamically compare two tree topologies, and see how each branch flip affects the arrangement rather than providing just one static comparison. If you want to try this out on the trees above the two newick files are attached at the end of this post to copy in. Finally, I would also note that there are various metrics which can provide a quantitative comparison between (many) trees, my favourite of which is the Kendall-Colijin metric implemented in treescape.

Update: Michelle Kendall (an author of both the KC metric and treescape) pointed out on twitter that they have added the function plotTreeDiff() as another way to look at topology changes. I gave this a go:


plotTreeDiff result. The two trees, with tips coloured based on how many differences in recent common ancestors there are with other tips.

From the manual:
A plot of the two trees side by side. Tips are coloured in the following way:

  • if each ancestor of a tip in tree 1 occurs in tree 2 with the same partition of tip descendants, then the tip is coloured grey (or supplied “baseCol”)
  • if not, the tip gets coloured pale orange to red on a scale according to how many differences there are amongst its most recent common ancestors with other tips. The colour spectrum can be changed according to preference.’

You can use tipDiff() to see the numeric results direcly.




I have added the likelihood ratio test (LRT) for logistic regression into seer, in addition to the existing Wald test as noted in issue 42. As this is likely to remain undocumented elsewhere, here are some brief notes:

  • Both the p-value from the Wald test, and the p-value from the new LRT are in the output.
  • The LRT is expected to be a more powerful test in some situations. I would recommend its use over the Wald test.
  • Testing has shown some clear cases (e.g. when population structure is not a major effect) where the Wald test performs poorly, and the LRT recovers the power of a chi-squared test.
  • I have also put in a LRT for linear regression, but based on an estimate of the residual errors (which therefore gives slightly different results to R at small sample sizes). I don’t expect it to make much, if any, difference in this case.

There’s a nice article on the Wald, LRT and score tests here.

I’ll package this update in a future release, but if you want it now you can checkout the master branch and compile it yourself.

I want to count the number of unique patterns in a vcf file. First I convert it to text with bcftools query:

bcftools query -f '[%GT]\n' vcf_in.vcf.gz > patterns.txt

The resulting patterns.txt is about 100Gb. The best way I found to count the unique patterns in this was with the following command:

LC_ALL=C sort -u --parallel=4 -S 990M -T ~/tmp_sort_files patterns.txt | wc -l

This used 1063Mb RAM, took 1521s and used a maximum of around 75Gb tmp space on my home (as the /tmp drive on the cluster ran out of space).

With thanks to

A recent paper by Earle et. al. nicely showed the use of linear mixed models to determine drug resistance related genetic variants. Part of the software provided is an R package called bugwas, which will make the nice plots in figure 1 for you.

Here are some notes on how to get it to run, and correctly format the input files

Getting gemma to work

You’ll need to use the author’s modified version of gemma, which can be downloaded here. This may not run on your system, due to the blas and lapack libraries being in unexpected places. You can easily solve this by making some symlinks.

First run ldd on the gemma executable to check which libraries cannot be found

ldd /nfs/users/nfs_j/jl11/installations/gemma.0.93b => (0x00007fff28000000) => /usr/lib/ (0x00007f9dfbcc0000) => /usr/lib/ (0x00007f9dfba78000) => not found => not found => /software/hgi/pkglocal/gcc-4.9.1/lib64/ (0x00007f9dfa8d0000) => /lib/x86_64-linux-gnu/ (0x00007f9dfa5d0000) => /software/hgi/pkglocal/gcc-4.9.1/lib64/ (0x00007f9dfa3b8000) => /lib/x86_64-linux-gnu/ (0x00007f9df9ff8000) => /software/hgi/pkglocal/gcc-4.9.1/lib64/ (0x00007f9df9cd8000)
 /lib64/ (0x00007f9dfc120000) => /software/hgi/pkglocal/gcc-4.9.1/lib/../lib64/ (0x00007f9df9a98000)

In my case this was libblas and liblapack. Find the libraries using ‘locate liblapack’ etc. Above, I have created symlinks to these in a location in my LD_LIBRARY_PATH variable ‘/nfs/users/nfs_j/jl11/software/lib’ (you can make a similar directory and export it using ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<new directory>’).

ln -s /usr/lib/lapack/ /nfs/users/nfs_j/jl11/software/lib/
ln -s /usr/lib/libblas/ /nfs/users/nfs_j/jl11/software/lib/

File formatting

The bugwas gen file format is as follows:
ps sample_1_name sample_2_name
23 A C

If you have a file readable into plink (stored as snps.bed, snps.bim and snps.fam) and bcftools, this is an easy conversion.

plink --bfile snps --recode vcf-iid --geno 0.05 --out bugwas
bcftools plugin missing2ref bugwas.vcf > bugwas2.vcf
bcftools query -f '%ID\t[%TGT\t]\n' bugwas2.vcf > bugwas_in.gen
sed -i -e 's/coor_//' -e 's/\/.//g' -e 's/\t$//' bugwas_in.gen
head -7 bugwas.vcf| tail -1 | cut -f 10- | sed 's/#/_/g'| awk '{print "ps\t" $0}' | cat - bugwas_in.gen > tmp.gen
mv tmp.gene bugwas_in.gen

As bugwas requires all sites to be imputed this code will take the major allele where any site is missing. An alternative would be to use the ancestral state, which the authors suggest using ClonalFrameML to do, though this will take longer to run.

If you’ve got a messy vcf rather than a bed as input a script I wrote may help you with missing/multiallelic sites.

Other pieces

You’ll also need gemma to generate the kinship matrix for the random effects:

gemma -bfile snps -gk 1 -o gemma_snps

You’ll need a phylogeny, which you can draw by standard methods (I’d recommend fasttree on your alignment if you’ve got > 1000 samples). If you’ve already got a tree make sure the tip labels match the samples in the gen file.

Run bugwas

The command you’ll need to run in R is of the form:

lin_loc(gen=”bugwas_in.gen”,pheno=”bugwas.pheno”,phylo=””,prefix=”gwas”,gem.path=”/nfs/users/nfs_j/jl11/installations/gemma.0.93b”, relmatrix=”output/gemma_snps.cXX.txt”, output.dir=”./”)

I’ve found for around 2 000 samples and 110 000 snps I needed around 15Gb of RAM and 4-5 hours to run.

Calculates the mean depth of mapped coverage over the whole genome from a mapped bam, using bedtools

bedtools genomecov -ibam  -g bedtools_genome.txt \
 | awk '$1=="genome" {tot += $2*$3; reads += $3} END {print tot/reads}'

To generate bedtools_genome.txt use

samtools faidx reference.fa
cut -f 1-2 reference.fa.fai > bedtools_genome.txt

PEER (probabilistic estimation of expression residuals) is a tool to determine hidden factors from expression data, for use in genetic association studies such as eQTL mapping.

The method is first discussed in a 2010 PLOS Comp Bio paper:
and a guide to its applications and use in a 2012 Nature Protocols paper:

To install a command line version of the tool, you can clone it from the github page

git clone

When installing, it won’t install the executable binary peertool by default, nor will it use a user directory as the install location (though the latter is addressed at the end of documentation). To install these, use the following commands:

cd peer && mkdir build && cd build
make install

Which will install peertool to ~/software/bin, which you can put on your path.