I want to count the number of unique patterns in a vcf file. First I convert it to text with bcftools query:
bcftools query -f '[%GT]\n' vcf_in.vcf.gz > patterns.txt
The resulting patterns.txt is about 100Gb. The best way I found to count the unique patterns in this was with the following command:
LC_ALL=C sort -u --parallel=4 -S 990M -T ~/tmp_sort_files patterns.txt | wc -l
This used 1063Mb RAM, took 1521s and used a maximum of around 75Gb tmp space on my home (as the /tmp drive on the cluster ran out of space).
With thanks to http://unix.stackexchange.com/questions/120096/how-to-sort-big-files
I recently upgraded from OS X 10.10 to 10.11. This has upgraded the version of the gfortran dynamic library from 2 to 3 (in /Library/Frameworks/R.framework/Resources/lib), which in turn causes problems in various R packages (msm, ape).
For those which give an error along the lines of
unable to load shared object
the solution seems to be to use install.packages recursively. Use it on the package that failed. If a dependency fails, use it on that too. Then restart R.
Some packages requiring compilation which link libgfortran (-lgfortran) fail, as the linker line does not give the correct directory through -L. I also have gfortran installed as part of gcc through homebrew, at /Users/john/homebrew/lib/gcc/4.9 (to do this, use ‘brew install gcc’).
Using this, add the line
to the file ~/.R/Makevars. This should work, as long as when you load the library you have this directory either indexed through OS X’s equivalent of ldconfig (if there is one?), or it is in LD_LIBRARY_PATH.
PEER (probabilistic estimation of expression residuals) is a tool to determine hidden factors from expression data, for use in genetic association studies such as eQTL mapping.
The method is first discussed in a 2010 PLOS Comp Bio paper:
and a guide to its applications and use in a 2012 Nature Protocols paper:
To install a command line version of the tool, you can clone it from the github page
git clone https://github.com/PMBio/peer
When installing, it won’t install the executable binary peertool by default, nor will it use a user directory as the install location (though the latter is addressed at the end of documentation). To install these, use the following commands:
cd peer && mkdir build && cd build
cmake -D CMAKE_INSTALL_PREFIX=~/software -D BUILD_PEERTOOL=1 ./..
Which will install peertool to ~/software/bin, which you can put on your path.
I have been using zsh within tmux, and found upon reattaching tmux X forwarding wasn’t working. For example when trying to launch gvim I’d get the error:
E233: cannot open display
The problem, a quick google determined, is that each time I ssh into my sever a new $DISPLAY environment variable is set. When I run ‘tmux attach’ the new $DISPLAY variable is passed through (see http://stackoverflow.com/questions/8645053/how-do-i-start-tmux-with-my-current-environment) so any new windows within tmux will have the correct environment. However the environment of any existing windows can’t be changed, causing the problem.
The best solution I found was proposed by Alex Teichman here: http://alexteichman.com/octo/blog/2014/01/01/x11-forwarding-and-terminal-multiplexers/
However I had two problems:
- It doesn’t seem to work with zsh rather than bash. I guess this is due to the behaviour of preexec() being different, but I couldn’t quickly work this out from the zsh manual.
- It maybe felt slightly inelegant to update $DISPLAY every single time a command is run
My solution is pretty similar. I add the following to ~/.zshrc:
echo $DISPLAY > ~/.display.txt
alias up_disp='export DISPLAY=`cat ~/.display.txt`'
This writes the correct $DISPLAY variable to a hidden file when a session is started (i.e. when I connect to the server). When I find forwarding isn’t working, I just run up_disp in that window.
Not the perfect solution, but it works ok for me
To assemble illumina sequence data I am currently trialling assembly with cortex. To be able to use their Perl script to automate the pipeline between reads in and variant calls requires vcftools and stampy to be installed, and you provide the installation paths as input to the script.
However when running make using the default downloaded stampy makefile I got the following error from g++ (v4.8.1):
g++ `python2.7-config --ldflags` -pthread -shared -Wl build/linux-x86_64-2.7-ucs4/pyx/maptools.o build/linux-x86_64-2.7-ucs4/c/map
utils.o build/linux-x86_64-2.7-ucs4/c/alignutils.o build/linux-x86_64-2.7-ucs4/readalign.o build/linux-x86_64-2.7-ucs4/algebras.o build/linux-x86_64-2.7-ucs4/frontend.o -o maptools.so
g++: error: unrecognized command line option ‘-Wl’
The solution was straightforward to find, as ever thanks to stackoverflow: http://stackoverflow.com/questions/21305309/g-doesnt-recognize-the-option-wl
All you need to do is edit lines 44 and 46 in the makefile, replacing the space after -Wl with a comma:
43 ifeq ($(platform),linux-x86_64)
44 g++ `$(python)-config --ldflags` -pthread -shared -Wl,$(objs) -o maptools.so
46 g++ `$(python)-config --ldflags` -pthread -dynamiclib -Wl,$(objs) -o maptools.so
As you can see from the surrounding if statement, this is only an issue on 64-bit linux platforms
I also tried compiling cortex with icc, but the compilation failed after a lot of errors. Rather than pursuing this further, I used gcc and only got warnings of unused variables in compilation