Archive

Statistics

In our software PopPUNK we draw a plot of a Gaussian mixture model that uses both the implementation and the excellent example in the scikit-learn documentation:

GMM ellipse example

Gaussian mixture model with mixture components plotted as ellipses

My input is 2D distance, which I first use StandardScaler to normalise each axes between 0 and 1, which helps standardise methods across various parts of the code. This is fine if you then create these plots in the scaled space, and as it is a simple linear scaling it is generally trivial to convert back into the original co-ordinates:

# scale X
scale = np.amax(X, axis = 0)
scaled_X = X / scale
# plot scaled
plt.scatter([(scaled_X)[Y == i, 0]], [(scaled_X)[Y == i, 1]], .4, color=color)
# plot original
plt.scatter([(scaled_X*scale)[Y == i, 0]], [(scaled_X*scale)[Y == i, 1]], .4, color=color)

The only thing that wasn’t obvious was how to scale the covariances, which are used to draw the ellipses. I knew they needed to be multiplied by the scale twice as they are variances (squared), and had a vague memory of something like xAx^T from transforming ellipses/conic sections in one of my first year maths courses. Happily that was enough to do a search, turning up an explanation on stackoverflow which confirmed this vague memory: https://math.stackexchange.com/questions/1184523/scaling-a-covariance-matrix

Here is the code for making the plot and incorporating a linear scaling

color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold','darkorange'])

fig=plt.figure(figsize=(11, 8), dpi= 160, facecolor='w', edgecolor='k')
splot = plt.subplot(1, 1, 1)
for i, (mean, covar, color) in enumerate(zip(means, covariances, color_iter)):
    scaled_covar = np.matmul(np.matmul(np.diag(scale), covar), np.diag(scale).T)
    v, w = np.linalg.eigh(scaled_covar)
    v = 2. * np.sqrt(2.) * np.sqrt(v)
    u = w[0] / np.linalg.norm(w[0])
    # as the DP will not use every component it has access to
    # unless it needs it, we shouldn't plot the redundant
    # components.
    if not np.any(Y == i):
        continue
    plt.scatter([(X)[Y == i, 0]], [(X)[Y == i, 1]], .4, color=color)

# Plot an ellipse to show the Gaussian component
    angle = np.arctan(u[1] / u[0])
    angle = 180. * angle / np.pi # convert to degrees
    ell = mpl.patches.Ellipse(mean*scale, v[0], v[1], 180. + angle, color=color)
    ell.set_clip_box(splot.bbox)
    ell.set_alpha(0.5)
    splot.add_artist(ell)

plt.title(title)
plt.savefig("ellipses.png")
plt.close()

Advertisements

Marco Galardini and I have recently reimplemented the bacterial GWAS software SEER in python. As part of this I rewrote my C++ code for Firth regression in python. Firth regression gives better estimates when data in logistic regression is separable or close to separable (when a chi-squared contingency table has small entries).

I found that although there is an R implementation logistf I couldn’t find an equivalent in another language, or python’s statsmodels. Here is a gist with my python functions and a skeleton of how to use them and calculate p-values, in case anyone would like to use this in future without having to write the optimiser themselves.

I have added the likelihood ratio test (LRT) for logistic regression into seer, in addition to the existing Wald test as noted in issue 42. As this is likely to remain undocumented elsewhere, here are some brief notes:

  • Both the p-value from the Wald test, and the p-value from the new LRT are in the output.
  • The LRT is expected to be a more powerful test in some situations. I would recommend its use over the Wald test.
  • Testing has shown some clear cases (e.g. when population structure is not a major effect) where the Wald test performs poorly, and the LRT recovers the power of a chi-squared test.
  • I have also put in a LRT for linear regression, but based on an estimate of the residual errors (which therefore gives slightly different results to R at small sample sizes). I don’t expect it to make much, if any, difference in this case.

There’s a nice article on the Wald, LRT and score tests here.

I’ll package this update in a future release, but if you want it now you can checkout the master branch and compile it yourself.

I am currently trying to model the state of a genetic locus in bacteria (which may be one of six values) using a hierarchical Bayesian model. This allows me to account for the fact that within a sample there is heterogeneity, as well as there being heterogeneity within a tissue type.

This is also good because:

  • I am able to incorporate results from existing papers as priors.
  • From the output I can quantify all the uncertainties within samples and tissues, check for significantly different distributions between condition types.

I think what I have ended up working with is probably something that could also be called a mixture model.

At any rate, I wrote some R code (available on my github) that uses rjags to interface with JAGS. I started off with code based on exercise 9.2 of ‘Doing Bayesian Data Analysis’ (Kruschke 2011, 1st ed.), which models the bias of multiple coins from multiple mints using data from the number of heads observed from each coin.

I then extended this to more than two outcomes (success, fail), which use a binomial distribution for the data and a beta distribution for the priors, to six outcomes. I now use a multinomial distribution for the data and Dirichlet distributions for the priors.

However I ran into an array of error messages when implementing this:

  RUNTIME ERROR:
Compilation error on line 10.
Dimension mismatch taking subset of pi  

 RUNTIME ERROR:
Compilation error on line 16.
Dimension mismatch taking subset of alpha

 RUNTIME ERROR:
Dimension mismatch when setting bounds in distribution ddirch

The first two of these was solved by getting all the brackets correct and consistent when tranisitoning from scalars -> vectors and vectors -> matrices. However the last message, ‘Dimension mismatch when setting bounds in distribution ddirch’, proved more insidious.

As JAGS is pretty closely based on BUGS, I searched for the problem in that domain (which seems to be a big advantage of JAGS compared to e.g. STAN – BUGS has been around for so long most problems have already arisen and been solved) and came across the following stackexchange/overflow issues that allowed me to solve the problem:
http://stackoverflow.com/questions/24349692/dirichlet-multinomial-winbugs-code
http://stats.stackexchange.com/questions/76644/define-priors-for-dirichlet-distribution-parameters-in-jags

The problem is that JAGS doesn’t allow the parameters of ddirch to be inferred. However you can use the observation that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements delta[k] / sum(delta[1:K]), k = 1, …, K, is Dirichlet with parameters alpha[k], k = 1, …, K.’
Then infer in the gamma distribution instead!

While you can find the full code here, the final working JAGS model code is pasted below. Note the use of the above trick, while using nested indices to allow for the hierarchical structure.

# JAGS model specification
model {
  # For each sample, likelihood and prior
  for (sample_index in 1:num_samples)
  {
    # likelihood
    # Number of reads that map to each allele is multinomial.
    # y and pi are matrices with num_samples rows and num_alleles columns
    y[sample_index,1:num_alleles] ~ dmulti(pi[sample_index,1:num_alleles], N[sample_index])

    # Dirichlet prior for proportion of population with allele in each sample
    #
    # This would be written like this:
    # pi[sample_index,1:num_alleles] ~ ddirch(alpha[tissue[sample_index],1:num_alleles])T(0.0001,0.9999)
    # Except JAGS doesn't allow the parameters of ddirch to be inferred
    #
    # Instead use the observation in the BUGS manual, and infer from a dgamma instead:
    # 'The trick is to note that if delta[k] ~ dgamma(alpha[k], 1), then the vector with elements 
    # delta[k] / sum(delta[1:K]), k = 1, ...,   K, is Dirichlet with parameters alpha[k], k = 1, ..., K.'
    for (allele_index in 1:num_alleles)
    {
      pi[sample_index, allele_index] 
      delta[sample_index, allele_index] ~ dgamma(alpha[tissue[allele_index],allele_index], 1)
    }
    
  }

  # For each tissue (blood or csf) hyperpriors
  for (tissue_index in 1:num_tissues)
  {
    # Convert a and b in beta prior, to mu and kappa
    # mu = mean allele for tissue,
    # kappa = how closely does sequence represent tissue - constant across all tissue types
    for (allele_index in 1:num_alleles)
    {
      alpha[tissue_index,allele_index] 
    }
 # hyperpriors for mu (beta dist)
 mu[tissue_index,1:num_alleles] ~ ddirch(AlphaMu[])
 }

 # Kappa hyperprior (gamma dist - shape and rate)
 kappa ~ dgamma(Skappa, Rkappa)

 # Top level constants for hyperpriors
 # This is a vector of length num_alleles
 AlphaMu <- alpha_priors

 # Gamma dist for kappa. First convert mean and sd to shape and rate
 Skappa <- pow(meanGamma,2)/pow(sdGamma,2)
 Rkappa <- meanGamma/pow(sdGamma,2)

 meanGamma <- 10
 sdGamma <- 10
}

On github: https://github.com/johnlees/pMCMC

Parallel implementation of MCMC using MPI – coded by Hákon Jónsson, John Lees and Tobias Madsen

Code is available as C++
Under testing implementations in R and Perl do not provide speedups due
to execution overheads, but are included as easier to read ‘pseudocode’
if required.

Details can be found in this draft paper: pMCMC

Acknowledgements

This work was completed for the Oxford Summer School in Computational Biology 2012 (http://www.stats.ox.ac.uk/research/genome/summer_school_2013/osscb12), and was supervised by Geoff Nicholls, Joe Herman and Jotun Hein.

Funding provided by a Wellcome Trust Vacation grant (JL) and COGANGS
– EU grant (HJ, TM).