I recently upgraded from OS X 10.10 to 10.11. This has upgraded the version of the gfortran dynamic library from 2 to 3 (in /Library/Frameworks/R.framework/Resources/lib), which in turn causes problems in various R packages (msm, ape).

For those which give an error along the lines of

unable to load shared object

the solution seems to be to use install.packages recursively. Use it on the package that failed. If a dependency fails, use it on that too. Then restart R.

Some packages requiring compilation which link libgfortran (-lgfortran) fail, as the linker line does not give the correct directory through -L. I also have gfortran installed as part of gcc through homebrew, at /Users/john/homebrew/lib/gcc/4.9 (to do this, use ‘brew install gcc’).

Using this, add the line

LDFLAGS=-L/Users/jl11/homebrew/lib/gcc/4.9

to the file ~/.R/Makevars. This should work, as long as when you load the library you have this directory either indexed through OS X’s equivalent of ldconfig (if there is one?), or it is in LD_LIBRARY_PATH.

 

Advertisements

PEER (probabilistic estimation of expression residuals) is a tool to determine hidden factors from expression data, for use in genetic association studies such as eQTL mapping.

The method is first discussed in a 2010 PLOS Comp Bio paper:
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000770
and a guide to its applications and use in a 2012 Nature Protocols paper:
http://www.nature.com/nprot/journal/v7/n3/pdf/nprot.2011.457.pdf

To install a command line version of the tool, you can clone it from the github page

git clone https://github.com/PMBio/peer

When installing, it won’t install the executable binary peertool by default, nor will it use a user directory as the install location (though the latter is addressed at the end of documentation). To install these, use the following commands:

cd peer && mkdir build && cd build
cmake -D CMAKE_INSTALL_PREFIX=~/software -D BUILD_PEERTOOL=1 ./..
make
make install

Which will install peertool to ~/software/bin, which you can put on your path.

 

Problem: I have genetic data at a single variant site, where the minor allele frequency (MAF) is set, and the prevalence of disease is known (Sr). The variant is truly associated with the phenotype, at an odds ratio (OR) I want to set. How do I simulate the phenotypes given these three parameters, and whether each sample has the variant (exposed) or not?

This is analogous to simulating data in a 2×2 contigency table, as discussed in stack exchange here: http://stats.stackexchange.com/questions/13193/generating-data-with-a-pre-specified-odds-ratio

Solution: As alluded to in one of the comments of the stack exchange post, but unfortunately not derived or written out in a formatted way, the number of disease cases is a quadratic equation which can be rewritten in terms of the above parameters. I found the derivation from the table of p11 (or De, number of disease cases that have the variant) in a book appendix.

This requires a little rewriting to get in terms of these parameters, but can be written down as:

Screen Shot 2015-10-09 at 17.18.13

I have implemented this in c++ as a function returning a tuple

std::tuple<double, double> p_case(const double OR, const double MAF, const double Sr)
{
   // Convenient defines
   const double m1 = Sr/(Sr + 1);

   // Quadratic equation
   const double a = OR - 1;
   const double b = -((m1 + MAF)*OR - m1 + 1 - MAF);
   const double c = OR*m1*MAF;
   double a1 = (-b - pow(pow(b, 2) - 4*a*c, 0.5)) / (2*a);

   // Probabilities to return
   double p_e = a1 / MAF;
   double p_ne = (m1 - a1)/(1 - MAF);

   return std::make_tuple(p_e, p_ne);
}

The whole code for the simulation can be found in subsample_seer.cpp here:
https://github.com/johnlees/bioinformatics/tree/master/seer

I am currently trying to use ALF (the stand-alone version) to simulate data from a custom tree, and include realistic parameters for SNP rate, INDEL rate, gene loss and recombination rates. This is a little different to what I think the program was originally designed for – small numbers of divergent organisms – but is probably an easier problem.

ALF is good because it includes a lot of features of evolution more naive models don’t encompass, and gives good output useful for further simulation and testing work.

I’ve made the following notes and tweaks to fix issues as I’ve been going along, which I hope may be of use to anyone trying to use the software for this purpose

  • For custom INDEL distributions, they must be specified in the parameters file as (note the double bracket):
    IndelModel(0.02,'CUSTOM', [[0.5,0.25,0.2,0.05]], 20)

    (thanks to the author Daniel Dalquen for helping me with this)

  • Custom trees must have no labels on the internal nodes. To ignore these you can remove the InternalLabels argument on line 820 of lib/simulator/SE_DataOutput.drw
  • Make sure ‘unitIsPam’ is set to false for trees with substitutions per site, which is the default unit for e.g. Raxml trees
  • If you’re simulating a lot of lateral gene transfer events with multiple genes, you’ll run into a transLoc out of range error due to a bug in the code. This can be fixed by changing line 604 in lib/simulator/SE_Evolutionary_Events.drw to
    place := Rand(0..length(geneR[org]) - lgtSize);

I have also written some helper scripts, which can be found in https://github.com/johnlees/bioinformatics/tree/master/sequence_evolution/ALF

  • gff2darwin.pl: Helps convert gff annotation files to custom input starting sequences
  • alf_db_to_fasta.pl: Converts the DB output formatting into a single fasta contig for an organism -> observed organism genome
  • alf_msa_concat.pl: Converts MSA output (which is by gene) into true alignments by organism -> true alignment
  • genes_to_contig.pl: Concatenates all contigs to create a whole genome alignment file (output from alf_msa_concat,pl) -> true alignment for population

I was trying to compile some C++ of the form

std::vector<std::thread> threads;
for (int i = 0; i<num_threads; ++i)
{
   threads.push_back(std::thread(logisticTest, kmer_lines[i], samples);
}

with function prototype

void logisticTest(Kmer& k, const std::vector<Sample>& samples);

on OS X 10.10 with clang++ – Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)

First error message: no matching function for call to ‘__invoke’
Solution: add ‘-std=c++11 -stdlib=libc++’ to CXXFLAGS in Makefile
(thanks to http://stackoverflow.com/questions/14115314/how-to-compile-c11-with-clang-3-2-on-osx-lion)

Second error message: attempt to use a deleted function __invoke(_VSTD::move(_VSTD::get<0>(__t)), _VSTD::mov…
Solution: pass to thread by value rather than reference. e.g.

std::reference_wrapper: threads.push_back(std::thread(logisticTest, std::ref(kmer_lines[i]), std::cref(samples)));

(thanks to http://stackoverflow.com/questions/8299545/passing-arguments-to-thread-function)

After upgrading from OS X 10.8 (Mountain Lion) to 10.10 (Yosemite), I found that gvim no longer worked and exited with a cryptic dyld message similar to ‘dyld: Symbol not found:’.

The first thing I tried was uninstalling it with homebrew, then reinstalling:

brew uninstall macvim
brew install macvim

But I got a ‘Trace/BPT trap: 5’ during make. Trying to fix this by doing the things suggested by ‘brew doctor’ and installing openssl gave me essentially the same errors.

brew install openssl

/bin/sh: line 1: 71319 Trace/BPT trap: 5 /usr/bin/perl tools/c_rehash certs/demo
make[2]: *** [rehash.time] Error 133
make[1]: *** [openssl] Error 2
make: *** [build_apps] Error 1

However here I could see perl was used in the ./configure step. Trying to run my perl scripts, I found they no longer worked:

dyld: lazy symbol binding failed: Symbol not found: _Perl_Istack_sp_ptr
Referenced from: /Users/jl11/perl/lib/perl5/darwin-thread-multi-2level/auto/Cwd/Cwd.bundle
Expected in: flat namespace

dyld: Symbol not found: _Perl_Istack_sp_ptr
Referenced from: /Users/jl11/perl/lib/perl5/darwin-thread-multi-2level/auto/Cwd/Cwd.bundle
Expected in: flat namespace

My perl install appears to be broken, and that was what was causing all the problems. Something similar was encountered before: http://probably.co.uk/rvm-install-and-homebrew-openssl-error-tracebpt-trap-5.html

I used perlbrew to fix the problem, and now it all works again! Run the following commands

\curl -L http://install.perlbrew.pl | bash
perlbrew install stable ­-Dusethreads
perlbrew switch stable

I had a PERL5LIB environment variable set which caused problems, check your @INC:

perl -e 'print join "\n", @INC'
/Users/jl11/perl/lib/perl5/darwin-thread-multi-2level
/Users/jl11/perl/lib/perl5
/Users/jl11/perl5/perlbrew/perls/perl-5.20.1_WITH_THREADS/lib/site_perl/5.20.1/darwin-thread-multi-2level
/Users/jl11/perl5/perlbrew/perls/perl-5.20.1_WITH_THREADS/lib/site_perl/5.20.1
/Users/jl11/perl5/perlbrew/perls/perl-5.20.1_WITH_THREADS/lib/5.20.1/darwin-thread-multi-2level
/Users/jl11/perl5/perlbrew/perls/perl-5.20.1_WITH_THREADS/lib/5.20.1

If it’s set put ‘PERL5LIB=’ in your ~/.bashrc or equivalent

Finally, you may need to install your perl modules into the new perl version again, documented here. Also, be wary of using scripts which have #!/usr/bin/perl as their first line, as they will run using the wrong perl version. perl <script_name> is safer

I am currently trying to model the state of a genetic locus in bacteria (which may be one of six values) using a hierarchical Bayesian model. This allows me to account for the fact that within a sample there is heterogeneity, as well as there being heterogeneity within a tissue type.

This is also good because:

  • I am able to incorporate results from existing papers as priors.
  • From the output I can quantify all the uncertainties within samples and tissues, check for significantly different distributions between condition types.

I think what I have ended up working with is probably something that could also be called a mixture model.

At any rate, I wrote some R code (available on my github) that uses rjags to interface with JAGS. I started off with code based on exercise 9.2 of ‘Doing Bayesian Data Analysis’ (Kruschke 2011, 1st ed.), which models the bias of multiple coins from multiple mints using data from the number of heads observed from each coin.

I then extended this to more than two outcomes (success, fail), which use a binomial distribution for the data and a beta distribution for the priors, to six outcomes. I now use a multinomial distribution for the data and Dirichlet distributions for the priors.

However I ran into an array of error messages when implementing this:

  RUNTIME ERROR:
Compilation error on line 10.
Dimension mismatch taking subset of pi  

 RUNTIME ERROR:
Compilation error on line 16.
Dimension mismatch taking subset of alpha

 RUNTIME ERROR:
Dimension mismatch when setting bounds in distribution ddirch

The first two of these was solved by getting all the brackets correct and consistent when tranisitoning from scalars -> vectors and vectors -> matrices. However the last message, ‘Dimension mismatch when setting bounds in distribution ddirch’, proved more insidious.

As JAGS is pretty closely based on BUGS, I searched for the problem in that domain (which seems to be a big advantage of JAGS compared to e.g. STAN – BUGS has been around for so long most problems have already arisen and been solved) and came across the following stackexchange/overflow issues that allowed me to solve the problem:
http://stackoverflow.com/questions/24349692/dirichlet-multinomial-winbugs-code
http://stats.stackexchange.com/questions/76644/define-priors-for-dirichlet-distribution-parameters-in-jags

The problem is that JAGS doesn’t allow the parameters of ddirch to be inferred. However you can use the observation that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements delta[k] / sum(delta[1:K]), k = 1, …, K, is Dirichlet with parameters alpha[k], k = 1, …, K.’
Then infer in the gamma distribution instead!

While you can find the full code here, the final working JAGS model code is pasted below. Note the use of the above trick, while using nested indices to allow for the hierarchical structure.

# JAGS model specification
model {
  # For each sample, likelihood and prior
  for (sample_index in 1:num_samples)
  {
    # likelihood
    # Number of reads that map to each allele is multinomial.
    # y and pi are matrices with num_samples rows and num_alleles columns
    y[sample_index,1:num_alleles] ~ dmulti(pi[sample_index,1:num_alleles], N[sample_index])

    # Dirichlet prior for proportion of population with allele in each sample
    #
    # This would be written like this:
    # pi[sample_index,1:num_alleles] ~ ddirch(alpha[tissue[sample_index],1:num_alleles])T(0.0001,0.9999)
    # Except JAGS doesn't allow the parameters of ddirch to be inferred
    #
    # Instead use the observation in the BUGS manual, and infer from a dgamma instead:
    # 'The trick is to note that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements 
    # delta[k] / sum(delta[1:K]), k = 1, ...,   K, is Dirichlet with parameters alpha[k], k = 1, ..., K.'
    for (allele_index in 1:num_alleles)
    {
      pi[sample_index, allele_index] 
      delta[sample_index, allele_index] ~ dgamma(alpha[tissue[allele_index],allele_index], 1)
    }
    
  }

  # For each tissue (blood or csf) hyperpriors
  for (tissue_index in 1:num_tissues)
  {
    # Convert a and b in beta prior, to mu and kappa
    # mu = mean allele for tissue,
    # kappa = how closely does sequence represent tissue - constant across all tissue types
    for (allele_index in 1:num_alleles)
    {
      alpha[tissue_index,allele_index] 
    }
 # hyperpriors for mu (beta dist)
 mu[tissue_index,1:num_alleles] ~ ddirch(AlphaMu[])
 }

 # Kappa hyperprior (gamma dist - shape and rate)
 kappa ~ dgamma(Skappa, Rkappa)

 # Top level constants for hyperpriors
 # This is a vector of length num_alleles
 AlphaMu <- alpha_priors

 # Gamma dist for kappa. First convert mean and sd to shape and rate
 Skappa <- pow(meanGamma,2)/pow(sdGamma,2)
 Rkappa <- meanGamma/pow(sdGamma,2)

 meanGamma <- 10
 sdGamma <- 10
}