I’ve just finished one of the most difficult compiler/packaging attempts I’ve ever tried and created a working bioconda package for seer. You can look at the pull request to see how many times I failed: https://github.com/bioconda/bioconda-recipes/pull/11263

(I would note I have made this package for legacy users. I would direct anyone interested in the software itself to the reimplementation pyseer)

The reason this was difficult was due to my own inclusion of a number of packages, all of which also need compilation, further adding to the complexity. While most of them were already available in some form in anaconda it turned out there were some issues with using the defaults. I want to note some of the things I had to modify here.

gcc and glibc version

seer requires gcc >4.8 to compile, and glibc > 4.9 to run. The default compiler version in conda is 4.8. Simply add a ‘conda_build_config.yaml’ file containing:

c_compiler:
  - gcc # [linux]
cxx_compiler:
  - gxx # [linux]

Submodules

I had dlib and gzstream as submodules. If you use a git_url as the source these clone recursively, but not with a standard url in meta.yaml. I needed to do ‘git clone –recursive’ with repository and tarball it myself to include these directories in the git hub release.

gzstream

Is not available on the bioconda channels so I had to compile myself. I included this as a submodule, but rather than using the default Makefile I needed to add the conda defined compiler flags to ensure these were consistent with later steps (particularly -fPIC in CPPFLAGS).

make CXX="$CXX $CXXFLAGS" CPPFLAGS="$CPPFLAGS -I. -I$PREFIX/include" AR="$AR cr"

boost

I was attempting to link boost_program_options using either the boost or boost-cpp anaconda packages, which unlike most boost libraries requires compiling. This led to undefined symbols at the linking stage, which I think are due to incompatible compiler (options) used to make the dynamic libraries in the versions currently available on anaconda. This turned out to be the most difficult thing to fix, requiring me to compile boost as part of the recipe.

Rather than downloading and compiling everything, I followed the boost github examples and made a shallow clone, with a fully copy of the boost library I’m using:

git clone --depth 1 https://github.com/boostorg/boost.git boost
rmdir libs/program_options
cd boost
git clone --depth 50 https://github.com/boostorg/program_options.git libs/program_options
git submodule update -q --init tools/boostdep
git submodule update -q --init tools/build
git submodule update -q --init tools/inspect

I then included this in the release tarball. A better way may be to use submodules so this is done automatically using –recursive.

This library needed to be built, but I did so in a work directory to avoid installing unexpected packages with the recipe. Following the conda-forge build.sh for boost-cpp:


pushd boost
python2 tools/boostdep/depinst/depinst.py program_options --include example
CXXFLAGS="${CXXFLAGS} -fPIC"
INCLUDE_PATH="${PREFIX}/include"
LIBRARY_PATH="${PREFIX}/lib"
LINKFLAGS="${LINKFLAGS} -L${LIBRARY_PATH}"
cat < tools/build/src/site-config.jam
using gcc : custom : ${CXX} ;
EOF
BOOST_BUILT="${SRC_DIR}/boost_built"
mkdir $BOOST_BUILT
./bootstrap.sh --prefix="${BOOST_BUILT}" --with-libraries=program_options --with-toolset=gcc
./b2 -q \
variant=release \
address-model="${ARCH}" \
architecture=x86 \
debug-symbols=off \
threading=multi \
runtime-link=shared \
link=static,shared \
toolset=gcc-custom \
include="${INCLUDE_PATH}" \
cxxflags="${CXXFLAGS}" \
linkflags="${LINKFLAGS}" \
--layout=system \
-j"${CPU_COUNT}" \
install
popd

The python2 line sorts out the header libraries required to compile, not included in the shallow clone. The rest are standard methods to install boost, ensuring the same compiler flags as the other compiled code and using the conda compiler.

I then needed to link this boost library statically (leaving the rest dynamic), so modified the make line as follows:

make CXXFLAGS="$CXXFLAGS" \
  SEER_LDLIBS="-L../gzstream -L${BOOST_BUILT}/lib -L/usr/local/hdf5/lib \
  -lhdf5 -lgzstream -lz -larmadillo -Wl,-Bstatic -lboost_program_options \
  -Wl,-Bdynamic -lopenblas -llapack -lpthread"

armadillo

The final trick was linking armadillo correctly. Confusingly it built and linked ok, tested ok locally, but on the bioconda CI I got undefined symbols to lapack at runtime:


seer: symbol lookup error: seer: undefined symbol: wrapper_dgbsv_

This was due to armadillo’s wrapper around its include which links in the versions of blas/openblas and lapack defined at the time it was compiled, which I think must be slightly different from what is now included with the armadillo package dependencies on conda. Easy enough to fix, use a compiler flag to turn the wrapper off and link the libraries manually:

make CPPFLAGS="${CPPFLAGS} -DARMA_DONT_USE_WRAPPER" \
  LDFLAGS="${LDFLAGS} -larmadillo -lopenblas -llapack"

After all of that, it finally worked!

Advertisements

I was getting the following error, attempting to run conda-build on a package, using a conda env:

Traceback (most recent call last):
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/bin/conda-build", line 7, in <module>
from conda_build.cli.main_build import main
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/cli/main_build.py", line 18, in <module>
import conda_build.api as api
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/api.py", line 22, in <module>
from conda_build.config import Config, get_or_merge_config, DEFAULT_PREFIX_LENGTH as _prefix_length
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/config.py", line 17, in <module>
from .variants import get_default_variant
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/variants.py", line 15, in <module>
from conda_build.utils import ensure_list, trim_empty_keys, get_logger
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/conda_build/utils.py", line 10, in <module>
import libarchive
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/__init__.py", line 1, in <module>
from .entry import ArchiveEntry
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/entry.py", line 6, in <module>
from . import ffi
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/site-packages/libarchive/ffi.py", line 27, in <module>
libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/ctypes/__init__.py", line 426, in LoadLibrary
return self._dlltype(name)
File "/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/python3.6/ctypes/__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive: cannot open shared object file: No such file or directory

The /lib directory does contain libarchive, both as a dynamic (.so) and static (.a) library. There turned out to be two relevant environment variables:

LIBARCHIVE=/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive
_CONDA_SET_LIBARCHIVE=LIBARCHIVE

A workaround is then to run

export LIBARCHIVE=/nfs/users/nfs_j/jl11/pathogen_nfs/large_installations/miniconda3/envs/conda_py36/lib/libarchive.so

There is probably a proper reason why this has happened and a permanent solution to the issue, but this works for now.

In our software PopPUNK we draw a plot of a Gaussian mixture model that uses both the implementation and the excellent example in the scikit-learn documentation:

GMM ellipse example

Gaussian mixture model with mixture components plotted as ellipses

My input is 2D distance, which I first use StandardScaler to normalise each axes between 0 and 1, which helps standardise methods across various parts of the code. This is fine if you then create these plots in the scaled space, and as it is a simple linear scaling it is generally trivial to convert back into the original co-ordinates:

# scale X
scale = np.amax(X, axis = 0)
scaled_X = X / scale
# plot scaled
plt.scatter([(scaled_X)[Y == i, 0]], [(scaled_X)[Y == i, 1]], .4, color=color)
# plot original
plt.scatter([(scaled_X*scale)[Y == i, 0]], [(scaled_X*scale)[Y == i, 1]], .4, color=color)

The only thing that wasn’t obvious was how to scale the covariances, which are used to draw the ellipses. I knew they needed to be multiplied by the scale twice as they are variances (squared), and had a vague memory of something like xAx^T from transforming ellipses/conic sections in one of my first year maths courses. Happily that was enough to do a search, turning up an explanation on stackoverflow which confirmed this vague memory: https://math.stackexchange.com/questions/1184523/scaling-a-covariance-matrix

Here is the code for making the plot and incorporating a linear scaling

color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold','darkorange'])

fig=plt.figure(figsize=(11, 8), dpi= 160, facecolor='w', edgecolor='k')
splot = plt.subplot(1, 1, 1)
for i, (mean, covar, color) in enumerate(zip(means, covariances, color_iter)):
    scaled_covar = np.matmul(np.matmul(np.diag(scale), covar), np.diag(scale).T)
    v, w = np.linalg.eigh(scaled_covar)
    v = 2. * np.sqrt(2.) * np.sqrt(v)
    u = w[0] / np.linalg.norm(w[0])
    # as the DP will not use every component it has access to
    # unless it needs it, we shouldn't plot the redundant
    # components.
    if not np.any(Y == i):
        continue
    plt.scatter([(X)[Y == i, 0]], [(X)[Y == i, 1]], .4, color=color)

# Plot an ellipse to show the Gaussian component
    angle = np.arctan(u[1] / u[0])
    angle = 180. * angle / np.pi # convert to degrees
    ell = mpl.patches.Ellipse(mean*scale, v[0], v[1], 180. + angle, color=color)
    ell.set_clip_box(splot.bbox)
    ell.set_alpha(0.5)
    splot.add_artist(ell)

plt.title(title)
plt.savefig("ellipses.png")
plt.close()

I recently read a pre-print from the Veening lab where they had reconstructed various (22 total) physiological conditions in vitro and then measured expression levels with RNA-seq. I thought it was a great bit of research, and would encourage you to read it here if you’re interested: https://doi.org/10.1101/283739

They’ve also done a really good job with data availability, having released a browser for their data (PneumoExpress), and they have put their raw data on zenodo.

I was interested in trying to replicate their expression module analysis, which was performed using WGCNA. I first downloaded the co-expression matrix, which is split across five CSV files in the zenodo archive. I then basically ran WGCNA exactly as suggested in their tutorial. I then used the annotations from the lab’s D39V genome to get final expression modules.

The expression modules look sensible to me, with the comE master regulator splitting the expression of the competence pathway and associated genes as expected. The modules I predicted can be found here.

Thanks to the authors for making their data so accessible! If you find this useful I would also note the authors also analysed their data with WGCNA, probably in a more optimal and error-checked way than I have, so most likely they have better clusters.

Here is a gist of the code I used:

I saw this phylogenetics package today, phyx: https://github.com/FePhyFoFum/phyx

To install without admin rights/sudo I needed to do the following (my software is installed in my home ~/software, rather than e.g. /usr, /usr/local):

Compile armadillo as follows

cmake -DINSTALL_PREFIX=$(HOME)/software
make
make install

Compile nlopt as follows

./configure --with-cxx --without-octave --without-matlab --prefix=$(HOME)/software
make
make install

Compile phyx as follows (slightly hacky, maybe there’s a ‘proper’ way)

./configure --prefix=$(HOME)/software

change line 11 of the Makefile (CPP_LIBS) to add the library path:

CPP_LIBS = -L$(HOME)/software/lib -llapack -lblas -lpthread -lm

change line 23 of the Makefile (OPT_FLAGS) to add the include path/CPPFLAGS:

OPT_FLAGS := -O3 -std=c++11 -fopenmp -I$(HOME)/software/include

then you can run

make 
make install

 

 

 

Trying to install PyVCF under a python (3) virtual environment gave me the following error:

(venv)johnlees@hpc:~$ pip install pyvcf
 Downloading/unpacking pyvcf
 Downloading PyVCF-0.6.8.linux-x86_64.tar.gz (1.1MB): 1.1MB downloaded
 Saved /tmp/downloadcache/PyVCF-0.6.8.linux-x86_64.tar.gz
 Running setup.py egg_info for package pyvcf
 Traceback (most recent call last):
 File "", line 16, in 
 FileNotFoundError: [Errno 2] No such file or directory: '~/venv/build/pyvcf/setup.py'
 Complete output from command python setup.py egg_info:
 Traceback (most recent call last):

File "", line 16, in

FileNotFoundError: [Errno 2] No such file or directory: '~/venv/build/pyvcf/setup.py'

The solution was to upgrade setuptools:

pip install --upgrade setuptools

I went from 0.6.34 to 36.8.0 (apparently!)

Marco Galardini and I have recently reimplemented the bacterial GWAS software SEER in python. As part of this I rewrote my C++ code for Firth regression in python. Firth regression gives better estimates when data in logistic regression is separable or close to separable (when a chi-squared contingency table has small entries).

I found that although there is an R implementation logistf I couldn’t find an equivalent in another language, or python’s statsmodels. Here is a gist with my python functions and a skeleton of how to use them and calculate p-values, in case anyone would like to use this in future without having to write the optimiser themselves.