Archive

Statistics

I have added the likelihood ratio test (LRT) for logistic regression into seer, in addition to the existing Wald test as noted in issue 42. As this is likely to remain undocumented elsewhere, here are some brief notes:

  • Both the p-value from the Wald test, and the p-value from the new LRT are in the output.
  • The LRT is expected to be a more powerful test in some situations. I would recommend its use over the Wald test.
  • Testing has shown some clear cases (e.g. when population structure is not a major effect) where the Wald test performs poorly, and the LRT recovers the power of a chi-squared test.
  • I have also put in a LRT for linear regression, but based on an estimate of the residual errors (which therefore gives slightly different results to R at small sample sizes). I don’t expect it to make much, if any, difference in this case.

There’s a nice article on the Wald, LRT and score tests here.

I’ll package this update in a future release, but if you want it now you can checkout the master branch and compile it yourself.

I am currently trying to model the state of a genetic locus in bacteria (which may be one of six values) using a hierarchical Bayesian model. This allows me to account for the fact that within a sample there is heterogeneity, as well as there being heterogeneity within a tissue type.

This is also good because:

  • I am able to incorporate results from existing papers as priors.
  • From the output I can quantify all the uncertainties within samples and tissues, check for significantly different distributions between condition types.

I think what I have ended up working with is probably something that could also be called a mixture model.

At any rate, I wrote some R code (available on my github) that uses rjags to interface with JAGS. I started off with code based on exercise 9.2 of ‘Doing Bayesian Data Analysis’ (Kruschke 2011, 1st ed.), which models the bias of multiple coins from multiple mints using data from the number of heads observed from each coin.

I then extended this to more than two outcomes (success, fail), which use a binomial distribution for the data and a beta distribution for the priors, to six outcomes. I now use a multinomial distribution for the data and Dirichlet distributions for the priors.

However I ran into an array of error messages when implementing this:

  RUNTIME ERROR:
Compilation error on line 10.
Dimension mismatch taking subset of pi  

 RUNTIME ERROR:
Compilation error on line 16.
Dimension mismatch taking subset of alpha

 RUNTIME ERROR:
Dimension mismatch when setting bounds in distribution ddirch

The first two of these was solved by getting all the brackets correct and consistent when tranisitoning from scalars -> vectors and vectors -> matrices. However the last message, ‘Dimension mismatch when setting bounds in distribution ddirch’, proved more insidious.

As JAGS is pretty closely based on BUGS, I searched for the problem in that domain (which seems to be a big advantage of JAGS compared to e.g. STAN – BUGS has been around for so long most problems have already arisen and been solved) and came across the following stackexchange/overflow issues that allowed me to solve the problem:
http://stackoverflow.com/questions/24349692/dirichlet-multinomial-winbugs-code
http://stats.stackexchange.com/questions/76644/define-priors-for-dirichlet-distribution-parameters-in-jags

The problem is that JAGS doesn’t allow the parameters of ddirch to be inferred. However you can use the observation that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements delta[k] / sum(delta[1:K]), k = 1, …, K, is Dirichlet with parameters alpha[k], k = 1, …, K.’
Then infer in the gamma distribution instead!

While you can find the full code here, the final working JAGS model code is pasted below. Note the use of the above trick, while using nested indices to allow for the hierarchical structure.

# JAGS model specification
model {
  # For each sample, likelihood and prior
  for (sample_index in 1:num_samples)
  {
    # likelihood
    # Number of reads that map to each allele is multinomial.
    # y and pi are matrices with num_samples rows and num_alleles columns
    y[sample_index,1:num_alleles] ~ dmulti(pi[sample_index,1:num_alleles], N[sample_index])

    # Dirichlet prior for proportion of population with allele in each sample
    #
    # This would be written like this:
    # pi[sample_index,1:num_alleles] ~ ddirch(alpha[tissue[sample_index],1:num_alleles])T(0.0001,0.9999)
    # Except JAGS doesn't allow the parameters of ddirch to be inferred
    #
    # Instead use the observation in the BUGS manual, and infer from a dgamma instead:
    # 'The trick is to note that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements 
    # delta[k] / sum(delta[1:K]), k = 1, ...,   K, is Dirichlet with parameters alpha[k], k = 1, ..., K.'
    for (allele_index in 1:num_alleles)
    {
      pi[sample_index, allele_index] 
      delta[sample_index, allele_index] ~ dgamma(alpha[tissue[allele_index],allele_index], 1)
    }
    
  }

  # For each tissue (blood or csf) hyperpriors
  for (tissue_index in 1:num_tissues)
  {
    # Convert a and b in beta prior, to mu and kappa
    # mu = mean allele for tissue,
    # kappa = how closely does sequence represent tissue - constant across all tissue types
    for (allele_index in 1:num_alleles)
    {
      alpha[tissue_index,allele_index] 
    }
 # hyperpriors for mu (beta dist)
 mu[tissue_index,1:num_alleles] ~ ddirch(AlphaMu[])
 }

 # Kappa hyperprior (gamma dist - shape and rate)
 kappa ~ dgamma(Skappa, Rkappa)

 # Top level constants for hyperpriors
 # This is a vector of length num_alleles
 AlphaMu <- alpha_priors

 # Gamma dist for kappa. First convert mean and sd to shape and rate
 Skappa <- pow(meanGamma,2)/pow(sdGamma,2)
 Rkappa <- meanGamma/pow(sdGamma,2)

 meanGamma <- 10
 sdGamma <- 10
}

On github: https://github.com/johnlees/pMCMC

Parallel implementation of MCMC using MPI – coded by Hákon Jónsson, John Lees and Tobias Madsen

Code is available as C++
Under testing implementations in R and Perl do not provide speedups due
to execution overheads, but are included as easier to read ‘pseudocode’
if required.

Details can be found in this draft paper: pMCMC

Acknowledgements

This work was completed for the Oxford Summer School in Computational Biology 2012 (http://www.stats.ox.ac.uk/research/genome/summer_school_2013/osscb12), and was supervised by Geoff Nicholls, Joe Herman and Jotun Hein.

Funding provided by a Wellcome Trust Vacation grant (JL) and COGANGS
– EU grant (HJ, TM).