Archive

Uncategorized

A recent paper by Earle et. al. nicely showed the use of linear mixed models to determine drug resistance related genetic variants. Part of the software provided is an R package called bugwas, which will make the nice plots in figure 1 for you.

Here are some notes on how to get it to run, and correctly format the input files

Getting gemma to work

You’ll need to use the author’s modified version of gemma, which can be downloaded here. This may not run on your system, due to the blas and lapack libraries being in unexpected places. You can easily solve this by making some symlinks.

First run ldd on the gemma executable to check which libraries cannot be found

ldd /nfs/users/nfs_j/jl11/installations/gemma.0.93b
 linux-vdso.so.1 => (0x00007fff28000000)
 libgsl.so.0 => /usr/lib/libgsl.so.0 (0x00007f9dfbcc0000)
 libgslcblas.so.0 => /usr/lib/libgslcblas.so.0 (0x00007f9dfba78000)
 libblas.so.3 => not found
 liblapack.so.3 => not found
 libstdc++.so.6 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libstdc++.so.6 (0x00007f9dfa8d0000)
 libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9dfa5d0000)
 libgcc_s.so.1 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libgcc_s.so.1 (0x00007f9dfa3b8000)
 libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9df9ff8000)
 libgfortran.so.3 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libgfortran.so.3 (0x00007f9df9cd8000)
 /lib64/ld-linux-x86-64.so.2 (0x00007f9dfc120000)
 libquadmath.so.0 => /software/hgi/pkglocal/gcc-4.9.1/lib/../lib64/libquadmath.so.0 (0x00007f9df9a98000)

In my case this was libblas and liblapack. Find the libraries using ‘locate liblapack’ etc. Above, I have created symlinks to these in a location in my LD_LIBRARY_PATH variable ‘/nfs/users/nfs_j/jl11/software/lib’ (you can make a similar directory and export it using ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<new directory>’).

ln -s /usr/lib/lapack/liblapack.so /nfs/users/nfs_j/jl11/software/lib/liblapack.so.3
ln -s /usr/lib/libblas/libblas.so /nfs/users/nfs_j/jl11/software/lib/libblas.so.3

File formatting

The bugwas gen file format is as follows:
ps sample_1_name sample_2_name
23 A C
etc.

If you have a file readable into plink (stored as snps.bed, snps.bim and snps.fam) and bcftools, this is an easy conversion.

plink --bfile snps --recode vcf-iid --geno 0.05 --out bugwas
bcftools plugin missing2ref bugwas.vcf > bugwas2.vcf
bcftools query -f '%ID\t[%TGT\t]\n' bugwas2.vcf > bugwas_in.gen
sed -i -e 's/coor_//' -e 's/\/.//g' -e 's/\t$//' bugwas_in.gen
head -7 bugwas.vcf| tail -1 | cut -f 10- | sed 's/#/_/g'| awk '{print "ps\t" $0}' | cat - bugwas_in.gen > tmp.gen
mv tmp.gene bugwas_in.gen

As bugwas requires all sites to be imputed this code will take the major allele where any site is missing. An alternative would be to use the ancestral state, which the authors suggest using ClonalFrameML to do, though this will take longer to run.

If you’ve got a messy vcf rather than a bed as input a script I wrote may help you with missing/multiallelic sites.

Other pieces

You’ll also need gemma to generate the kinship matrix for the random effects:

gemma -bfile snps -gk 1 -o gemma_snps

You’ll need a phylogeny, which you can draw by standard methods (I’d recommend fasttree on your alignment if you’ve got > 1000 samples). If you’ve already got a tree make sure the tip labels match the samples in the gen file.

Run bugwas

The command you’ll need to run in R is of the form:

lin_loc(gen=”bugwas_in.gen”,pheno=”bugwas.pheno”,phylo=”fasttree.tr”,prefix=”gwas”,gem.path=”/nfs/users/nfs_j/jl11/installations/gemma.0.93b”, relmatrix=”output/gemma_snps.cXX.txt”, output.dir=”./”)

I’ve found for around 2 000 samples and 110 000 snps I needed around 15Gb of RAM and 4-5 hours to run.

Since October 2013 I have stopped using Fedora, and instead use machines running Ubuntu 12.04/13.10, Windows 8 and OS X 10.8.5. As these OSs have a larger user base than Fedora, many of the issues I encounter are well documented and easy to fix (i.e. there is a stackexchange post as one of the top three google results), hence there haven’t been many things for me to post under the original remit of this blog.

Of course, when I do encounter an undocumented OS based issue as I go about my business I’ll still try and post it on leesjohn. However I expect this to be much less common than previously, and the new computing based issues I find myself having to deal with are:
Interactions and differences between OS X and Ubuntu when working with them simultaneously
Working with Ubuntu without a sudo account (e.g. installing software, using custom libraries)
Use of radio software (e.g. Rivendell, Cuedex, Jack)

I have now changed area from physics to bioinformatics, and think there is scope to share many of the scripts and programs I write for this, as well as solutions to issues I encounter in the area. So I have finally gotten round to setting up a github account (https://github.com/johnlees) to share as much of the code I write as possible.

From now on leesjohn will primarily be to document the scripts in these repositories, and to share some original tools. I’ve already committed some things, which you can see at:
https://github.com/johnlees/bioinformatics (for bioinformatics tools)
https://github.com/johnlees/config (for software related configuration)
Hopefully this will make it very easy if someone ever does want to use some of the stuff I’ve written.

In the next few weeks I am hoping to write some posts about some of the more useful/general things in these repositories. I am also planning on making a wrapper script to allow you to use impute2 (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html) to infer your whole genome from the ‘raw data’ you get if you have had a 23andme done (https://www.23andme.com/) – which as far as I can tell is not something yet available in the public sphere, but something I think many of 23andme’s clients could be interested in.