I want to count the number of unique patterns in a vcf file. First I convert it to text with bcftools query:

bcftools query -f '[%GT]\n' vcf_in.vcf.gz > patterns.txt

The resulting patterns.txt is about 100Gb. The best way I found to count the unique patterns in this was with the following command:

LC_ALL=C sort -u --parallel=4 -S 990M -T ~/tmp_sort_files patterns.txt | wc -l

This used 1063Mb RAM, took 1521s and used a maximum of around 75Gb tmp space on my home (as the /tmp drive on the cluster ran out of space).

With thanks to http://unix.stackexchange.com/questions/120096/how-to-sort-big-files


I was working on an OS X system which kept getting annoying pop-ups about the system needing clean up, anti-virus software etc. I was able to see that the window was titled ‘helperamc’.

It turns out this was a remnant from Advanced Mac Cleaner, the use of which I won’t comment on here. The user of the system had tried to remove it when upgrading OS X version, but the annoying advertising component remained.

Killing the process and deleting the application doesn’t work, as it has a daemon to relaunch itself. After some investigation I found the following commands (issued in terminal.app) will sort out this issue and remove helperamc for good:

launchctl unload ~/Library/LaunchAgents/com.pcv.hlpramcn.plist
rm ~/Library/Application\ Support/amc/helperamc.app

As they are user files admin access is not needed. You may need to kill the helperamc process between these commands.

A recent paper by Earle et. al. nicely showed the use of linear mixed models to determine drug resistance related genetic variants. Part of the software provided is an R package called bugwas, which will make the nice plots in figure 1 for you.

Here are some notes on how to get it to run, and correctly format the input files

Getting gemma to work

You’ll need to use the author’s modified version of gemma, which can be downloaded here. This may not run on your system, due to the blas and lapack libraries being in unexpected places. You can easily solve this by making some symlinks.

First run ldd on the gemma executable to check which libraries cannot be found

ldd /nfs/users/nfs_j/jl11/installations/gemma.0.93b
 linux-vdso.so.1 => (0x00007fff28000000)
 libgsl.so.0 => /usr/lib/libgsl.so.0 (0x00007f9dfbcc0000)
 libgslcblas.so.0 => /usr/lib/libgslcblas.so.0 (0x00007f9dfba78000)
 libblas.so.3 => not found
 liblapack.so.3 => not found
 libstdc++.so.6 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libstdc++.so.6 (0x00007f9dfa8d0000)
 libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9dfa5d0000)
 libgcc_s.so.1 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libgcc_s.so.1 (0x00007f9dfa3b8000)
 libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9df9ff8000)
 libgfortran.so.3 => /software/hgi/pkglocal/gcc-4.9.1/lib64/libgfortran.so.3 (0x00007f9df9cd8000)
 /lib64/ld-linux-x86-64.so.2 (0x00007f9dfc120000)
 libquadmath.so.0 => /software/hgi/pkglocal/gcc-4.9.1/lib/../lib64/libquadmath.so.0 (0x00007f9df9a98000)

In my case this was libblas and liblapack. Find the libraries using ‘locate liblapack’ etc. Above, I have created symlinks to these in a location in my LD_LIBRARY_PATH variable ‘/nfs/users/nfs_j/jl11/software/lib’ (you can make a similar directory and export it using ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<new directory>’).

ln -s /usr/lib/lapack/liblapack.so /nfs/users/nfs_j/jl11/software/lib/liblapack.so.3
ln -s /usr/lib/libblas/libblas.so /nfs/users/nfs_j/jl11/software/lib/libblas.so.3

File formatting

The bugwas gen file format is as follows:
ps sample_1_name sample_2_name
23 A C

If you have a file readable into plink (stored as snps.bed, snps.bim and snps.fam) and bcftools, this is an easy conversion.

plink --bfile snps --recode vcf-iid --geno 0.05 --out bugwas
bcftools plugin missing2ref bugwas.vcf > bugwas2.vcf
bcftools query -f '%ID\t[%TGT\t]\n' bugwas2.vcf > bugwas_in.gen
sed -i -e 's/coor_//' -e 's/\/.//g' -e 's/\t$//' bugwas_in.gen
head -7 bugwas.vcf| tail -1 | cut -f 10- | sed 's/#/_/g'| awk '{print "ps\t" $0}' | cat - bugwas_in.gen > tmp.gen
mv tmp.gene bugwas_in.gen

As bugwas requires all sites to be imputed this code will take the major allele where any site is missing. An alternative would be to use the ancestral state, which the authors suggest using ClonalFrameML to do, though this will take longer to run.

If you’ve got a messy vcf rather than a bed as input a script I wrote may help you with missing/multiallelic sites.

Other pieces

You’ll also need gemma to generate the kinship matrix for the random effects:

gemma -bfile snps -gk 1 -o gemma_snps

You’ll need a phylogeny, which you can draw by standard methods (I’d recommend fasttree on your alignment if you’ve got > 1000 samples). If you’ve already got a tree make sure the tip labels match the samples in the gen file.

Run bugwas

The command you’ll need to run in R is of the form:

lin_loc(gen=”bugwas_in.gen”,pheno=”bugwas.pheno”,phylo=”fasttree.tr”,prefix=”gwas”,gem.path=”/nfs/users/nfs_j/jl11/installations/gemma.0.93b”, relmatrix=”output/gemma_snps.cXX.txt”, output.dir=”./”)

I’ve found for around 2 000 samples and 110 000 snps I needed around 15Gb of RAM and 4-5 hours to run.

Calculates the mean depth of mapped coverage over the whole genome from a mapped bam, using bedtools

bedtools genomecov -ibam  -g bedtools_genome.txt \
 | awk '$1=="genome" {tot += $2*$3; reads += $3} END {print tot/reads}'

To generate bedtools_genome.txt use

samtools faidx reference.fa
cut -f 1-2 reference.fa.fai > bedtools_genome.txt

I recently upgraded from OS X 10.10 to 10.11. This has upgraded the version of the gfortran dynamic library from 2 to 3 (in /Library/Frameworks/R.framework/Resources/lib), which in turn causes problems in various R packages (msm, ape).

For those which give an error along the lines of

unable to load shared object

the solution seems to be to use install.packages recursively. Use it on the package that failed. If a dependency fails, use it on that too. Then restart R.

Some packages requiring compilation which link libgfortran (-lgfortran) fail, as the linker line does not give the correct directory through -L. I also have gfortran installed as part of gcc through homebrew, at /Users/john/homebrew/lib/gcc/4.9 (to do this, use ‘brew install gcc’).

Using this, add the line


to the file ~/.R/Makevars. This should work, as long as when you load the library you have this directory either indexed through OS X’s equivalent of ldconfig (if there is one?), or it is in LD_LIBRARY_PATH.


PEER (probabilistic estimation of expression residuals) is a tool to determine hidden factors from expression data, for use in genetic association studies such as eQTL mapping.

The method is first discussed in a 2010 PLOS Comp Bio paper:
and a guide to its applications and use in a 2012 Nature Protocols paper:

To install a command line version of the tool, you can clone it from the github page

git clone https://github.com/PMBio/peer

When installing, it won’t install the executable binary peertool by default, nor will it use a user directory as the install location (though the latter is addressed at the end of documentation). To install these, use the following commands:

cd peer && mkdir build && cd build
make install

Which will install peertool to ~/software/bin, which you can put on your path.


Problem: I have genetic data at a single variant site, where the minor allele frequency (MAF) is set, and the prevalence of disease is known (Sr). The variant is truly associated with the phenotype, at an odds ratio (OR) I want to set. How do I simulate the phenotypes given these three parameters, and whether each sample has the variant (exposed) or not?

This is analogous to simulating data in a 2×2 contigency table, as discussed in stack exchange here: http://stats.stackexchange.com/questions/13193/generating-data-with-a-pre-specified-odds-ratio

Solution: As alluded to in one of the comments of the stack exchange post, but unfortunately not derived or written out in a formatted way, the number of disease cases is a quadratic equation which can be rewritten in terms of the above parameters. I found the derivation from the table of p11 (or De, number of disease cases that have the variant) in a book appendix.

This requires a little rewriting to get in terms of these parameters, but can be written down as:

Screen Shot 2015-10-09 at 17.18.13

I have implemented this in c++ as a function returning a tuple

std::tuple<double, double> p_case(const double OR, const double MAF, const double Sr)
   // Convenient defines
   const double m1 = Sr/(Sr + 1);

   // Quadratic equation
   const double a = OR - 1;
   const double b = -((m1 + MAF)*OR - m1 + 1 - MAF);
   const double c = OR*m1*MAF;
   double a1 = (-b - pow(pow(b, 2) - 4*a*c, 0.5)) / (2*a);

   // Probabilities to return
   double p_e = a1 / MAF;
   double p_ne = (m1 - a1)/(1 - MAF);

   return std::make_tuple(p_e, p_ne);

The whole code for the simulation can be found in subsample_seer.cpp here: