Impute your whole genome from 23andme data

23andme is a service which types 602352 sites on your chromosomal DNA and your mtDNA. It is possible, by comparing to a reference panel in which all sites have been typed, to impute (fill in statistically) the missing sites and thus get an ‘estimation’ of your whole genome.

The piece of software impute2 written by B. N. Howie, P. Donnelly, and J. Marchini gives good accuracy when using the 1000 Genome Project as a reference. However, there is some difficulty in providing the data in the right input format, using all the correct options and interpreting the output from this piece of software.

EDIT: As pointed out by lassefolkersen in the comments, this has now been nicely implemented at impute.me

I have written a tool to allow people with a small amount computational experience (but not necessarily any biological/bioinformatics knowledge) to run this tool on their 23andme data to get their whole genome output, which can be found at my github: https://github.com/johnlees/23andme-impute

To use this tool, you will need to do the following steps:

  1. Download your ‘raw data’ from the 23andme site. This is a file named something like genome_name_full
  2. Download the impute2 software from https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download and follow their instructions to install it
  3. Put impute2 on the path, i.e. run (with the correct path for where you extracted impute2):
    echo “export PATH=$PATH:/path/to/impute2” >> ~/.bashrc
  4. Download the 1000 Genomes reference data, which can be found on the impute2 website here:
    https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html
  5. Extract this data by running:
    gunzip ALL_1000G_phase1integrated_v3_impute.tgz
    tar xf ALL_1000G_phase1integrated_v3_impute.tar
    (you will then probably want to delete the original, unextracted archive file as it is quite large)
  6. Download my code by running:
    git clone https://github.com/johnlees/23andme-impute
  7. Run ./impute_genome.pl to impute your whole genome!

The options required as input for impute_genome.pl should be reasonably straightforward, run with -h to see them, or look at the README.md on github.

As the analysis will take a lot of resources, I recommend against using the run command. I think –print or –write will be best for most people, and you can then run each job one at a time or in parallel if you have access to a cluster.

If you have any problems with this, please leave a message in the comments and I’ll try my best to get back to you.

Advertisements
19 comments
  1. J1 said:

    I really want my 23andme results imputted. Yet, there appears to be no commercial service that will do this.

    I have followed your recipe and I was able to work through most of it, though I have still not been able to do the imputation.
    Might you help?

    Step 1,2,3 check.

    Step 4. The Impute2 website lists this url http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_9-12-13.html
    for the latest 1,000 genomes release.

    Step 5 I unzipped the files from Impute2 with 7Zip. The resulting haplotype files were massive (in the gigabytes).
    I am not sure if my PC could handle processing such a large file. Can this imputation be done on a PC? Can it be done
    in manageable chunks?

    This is very frustrating because the Impute2 example worked right away. Is there a way to split the haplotype files to make them more manageable?

    I have not used the command line before. Which command prompt are you using? I have been variously trying DS command prompt, Powershell command prompt, GitHub, Cygwin, and Activeperl.

    I have also downloaded Plink. Plink has a function that converts 23andme data to the Impute2 format.
    The only problem is for homozygous calls ( e.g. CC the program does not know what the other allele is so it reports a 0. Otherwise the program gives the correct 3 bit code ( e.g. 001).

    I would like to use your 23andme.pm program to also make a g file. What command and what command prompt should I use to do this(Activeperl?) ?

    Using the Impute2 program directly might work better for me. There are several options suggested (phasing etc.). What Impute2 commands would you suggest? Also showing which files from the download correspond to the Impute2 file types would be helpful. I have not been able to obtain the G Strand file. Do you have a url for it?

    It might just be easier to send the data off to a cluster somewhere. Are there commercial services ( on the cloud etc.) that would do the imputation? Perhaps a genomics centre (maybe oxford?) ?

    I am very excited about doing the imputation. Any help would be appreciated.

    • jofunu6 said:

      Hi,

      Trying to take your points in order:
      Step 4: That could well be right — with your version it looked liked they’ve phased the reference panel with SHAPEIT2 which they show increases imputation performance. Hopefully the same files are present (including X chr etc), though you’d need to change the reference prefix variable in 23andme.pm

      Step 5: Yes! It should be able to, as long as there’s enough disk space for the files. You’ll also need a reasonable amount of RAM, I guess a few Gb but I haven’t checked yet.
      The imputation is run in overlapping blocks of 5Mbases so you don’t have the split the haplotype files up. This works out to around 1000 jobs for the whole genome (which is around 3Gbases). I haven’t tried to run all of these jobs on a desktop/laptop myself, but I ran one as a test and it took about 30 seconds on an i7 CPU — if you extrapolate to all jobs that works to a total run time of around 8/9 hours.

      Command line: If you’re on Windows I’d recommend Cygwin. If you also install activeperl you should be able to run my perl scripts

      Didn’t know Plink had 23andme input now, that’s cool! The first thing my script does is essentially the same, and a similar problem is faced with homozygous calls. What I had been doing was reporting 100 and arbitrarily setting the non-reference allele – this actually causes problems with the imputation later on (though I guess no worse than having a null call) and I need to make a change to make sure if the site is in the reference panel that the same legend is used i.e. get the major and minor alleles from there and write them into the input .gen files.
      Another thing I am slightly concerned with is performance with just one input sample – this is something I hope to look into in the future.

      Impute2 is pretty complicated to use, my recommended commands for 23andme data are output if you run my script with the –print option. You don’t need a -strand_g with 23andme data as all calls are on the positive strand (see https://customercare.23andme.com/entries/21272593-Which-DNA-strand-does-23andMe-report-for-SNP-genotypes-) which is the default assumed by impute2 without the file

      Computing the data: I imagine commerical services are available, but I would get single jobs working on your own machine first as if you send them off they may just come back with errors or terrible accuracy. I also think that you’ll probably find it ok to leave all the jobs running on your computer overnight once you get it working.

      Best of luck! Hopefully I’ll have a chance to look into this again and fix bugs in the near future

      • J1 said:

        Thank you very much for your response. I have not found very much helpful information on the internet on how to do this imputation

        Plink 1.9 can be downloaded from https://www.cog-genomics.org/plink2/

        If you follow the 23andme text link. They show how a 23andme file can be into Plink. They also mention how you can use –list-23-indels to minimize the impact of missing calls. (I was not able to figure out how to make the –list call). Above this on the webpage they talk about putting the file into Oxford format. I cannot remember exactly how I did the conversion, though I think I used a recode command.

        On https://www.cog-genomics.org/plink2/formats they refer to a .frq format. This would be ideal to call in the program and find out the missing alleles are in the homozygous calls. There must be a file somewhere that has the allele calls for the 23andme data.

        I am making progress toward doing the imputation. One problem that I was able to figure out was with the haplotype files. These files are huge (3 GB). When I opened them in Wordpad they seemed to crash my computer. It seemed that my computer had been only able to open 3 pages of the file. It was only when I dragged down the screen that I realized that the file had been opened. (It was just that the computer displayed the file gradually as I moved the screen down.) This is great. I did not know whether my computer could handle such large files. I think now the imputation is doable.

        However, when I tried prephasing in Impute2 with the 1000 genomes M file and the 23andme file converted to a Gfile in Impute2 format using Plink my computer crashed, Impute2 made it to the first iteration of the 30 iterations of the MCMC run,and then crashed. I think the problem is with the 23andme file converted to a Gfile in Impute2 format.

        This is what my Gfile looks like:
        22 rs123 20000 0 G 0 0 1
        22 rs124 20100 0 A 0 0 1
        22 rs130 20500 T C 0 1 0
        22 rs135 20700 0 A 0 0 1

        The Impute2 example Gfile looks like:
        SNP_A-218 rs48211 20303 A G 0 0 1 0 1 0 0 0 1
        SNP_A-197 rs85184 20333 G T 0 1 0 0 0 1 1 0 0

        It is possible that because my Gfile does not have SNP identifiers as the Impute2 Gfiles do and that my Gfile is lacking named alleles for the homozygous calls while Impute2 Gfiles include both alleles that Impute2 is unable to read the file properly.

        I mentioned that I have downloaded Activeperl. If you could tell me what exact command I should use to get output for a Gfile formatted file from your 23andme.pm program then the imputation might run smoothly. If I have to do 1,000 runs of Impute2 for an imputation, it is worth it. As soon as I have properly formatted data, I should have no further problems with the imputation. I have been able to run almost all the example runs with Impute2 without complication.

        Alternatively, if you are able to use the Plink 1.9 program to convert the 23andme file into the Impute2 format (Oxford style) and this runs imputations properly on your computer, then I would be very interested to know what exact commands you used in Plink.

        I think I am very close to having a successful run using my data with Impute2. Your comments and suggestions would be greatly appreciated.

      • jofunu6 said:

        Regarding Plink:
        I imagine something like
        plink –23file –list-23-indels –recode oxford –out
        should do it. If that doesn’t work do the 23 read in and output to a plink .bed, then recode that (i.e. do it in two steps). Just use the SNP file, not the indels

        To use my script, in the folder you’ve downloaded it run:
        perl impute_genome.pl -i 23andme_rawdata.txt -o imputed –write
        This will make .gen files, and write impute2 commands to run to .sh files.

        I imagine that impute2 uses only the position and allele values rather than id to check whether it thinks a site is the same, but I might be wrong. In any case the solution should be to take these values from the .legend file in the 1000GP data, and then map the 23andme data onto it. This is something I will implement in my code at some point (soon), but for now I can’t think of a way to do it with existing software. If it works with Plink you shouldn’t see any (or only a small number of) type 2 SNPs when you run impute2

        If you know some perl you could always fork my code on github and trying writing the fix!

        I do remember when I tried it myself the imputation crashing after one iteration, possibly because of the above issue. If that isn’t the problem, then it’s a question best asked to the impute2 authors here: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=OXSTATGEN

  2. J1 said:

    I am so happy that I have finally been able to run Impute2 on a 23andme file. I have already generated almost half a million imputed SNPs!

    It turns out that Impute2 requires at least 2 people in the g file to run without crashing.
    Plink 1.9 allows you to read in 2 23andme files, merge them with –merge, and then output in Oxford format.

    When I inputted 2 different person’s 23andme files into Plink, merged and recoded, Impute2 ran successfully.

    However, I am not sure whether or not the results are accurate. For I also ran Impute2 using 1 person and recorded the genotype twice. When I did this, the imputed genotype type probabilities changed.

    For example, this was a result from the Impute2 software when I inputted a file with a genotype reported twice for one individual. In the output, whenever the SNP was on the 23andme gene chip, Impute2 reported the call with a 22 in the first column. It then made an imputation for the call in the next line. 96.8% seems reasonable.

    22 rs624100 21001422 0 T 0 0 1 0 0 1
    — rs624100 21001422 G T 0 0.032 0.968 0 0.032 0.968

    However, when I merged the 23andme files from 2 different people in Plink, recoded to Oxford format and ran with Impute2, the result was quite different. This time the first genotype is for the individual with the genotype imputed above (twice). For some strange reason, Impute2 is now assigning a 38.5% probability for the TT genotype. However, this SNP is on the 23andme gene chip and it was called TT. (The chip has an accuracy of 99.99%).

    — rs624100 21001422 G T 0.089 0.526 0.385 0.091 0.533 0.376

    I do not understand why changing the g file would change the resulting imputation. Impute2 should be imputing using the 1000 Genomes reference panel, not the g file.

  3. Nice work!

    Yes that problem is a little strange. Of course there’s a random element with the MCMC process, so if you’re comparing runs you might want to specify the same seed manually, but I doubt that accounts for that much difference.
    Also even with 99.99% accuracy, an average of 60 sites will be incorrectly called so it might be worth checking a range of sites (but again I doubt this is the problem).

    I’m not sure imputation is independent of the -g file, I think the other records are also used in the integration (which I guess is particularly useful for sites called as missing) as well as the reference panel — see the image at the top of https://mathgen.stats.ox.ac.uk/impute/impute_v2.html or you might also like to read the papers DOI: 10.1371/journal.pgen.1000529 and doi:10.1038/ng2088
    You might also want to play around with some more options, for example -pgs and -fill-holes, or look at the discordance you report across the whole imputation.

    If the T has a low MAF in the population then the imputation will drop in quality. Impute2 automatically masks some of the typed SNPs, imputes them, and compares them to the known calls to estimate accuracy. There are some plots of the genotype concordance you can expect across a range of MAF on the impute2 website, and you’ll see for rare variants the imputation accuracy is quite low (as one would expect).

  4. J1 said:

    I am so happy that this is working out. I have wanted to do this for years, though I was very afraid of command line programming. It turned out to be much easier than I had anticipated.

    I think the best way to resolve the imputation problem that I had is to merge 23andme files to my file and then run it with Impute2. I have already downloaded some of these files from OpenSNP. { Impute2 is probably quite accurate when used as it was intended (i.e. with a sizable number in the g file). }

    It would be interesting if the Impute2 results could be used to find errors in the 23andme file. I am not sure whether or not 23andme does this check. Namely, if Impute2 determined that a certain haplotype was present and that this haplotype could not contain the genotyped SNP from the 23andme file, then the 23andme genotype could be corrected.

    It will be amazing when I can run the Minor Allele Program on my full genome! [The 23andme file includes approximately 25,000 rare SNPs ( this is because the gene chip needs to include a certain number of SNPs per million base pairs). The rare SNPs (especially the missense ones) can be very informative.] When I ran the Minor Allele Program on my 23andme file, I obtained some very interesting results. If the Minor Allele Program could be reprogrammed to include the Impute2 output, then there would likely be some extremely interesting results.

    { From the last post: The MAF for rs624100 is 47%. The figure from http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_9-12-13.html shows the aggregate r2 for this frequency in the high .90s.)

    There were quite a few other strange results from my initial imputation runs.

    For example (results from the 2 person file),

    rs111622021 21000850 C T 0.998 0.002 0 0.997 0.003 0

    dbSNP lists the MAF as 0.0028. However, Impute2 is calling the first genotype CC with 99.8% certainty.
    (Is that what the 0.998 means?) The Impute2 site only claims a .5 aggregate r2 for SNPs with .2% allele frequency.

    Also, the first set of SNPs are from the “double” genotype file. These results look too good to be true (they are almost all 1 0 0).

    22 rs178040 21141734 0 C 0 0 1 0 0 1
    — rs178040 21141734 C A 1 0 0 1 0 0
    — rs145392423 21141843 G A 0.966 0.034 0 0.966 0.034 0
    — rs192316527 21141926 C A 1 0 0 1 0 0
    — rs184379047 21141963 A G 1 0 0 1 0 0
    — rs79903906 21142064 T C 1 0 0 1 0 0
    — rs190160446 21142206 G A 1 0 0 1 0 0
    — rs181865789 21142227 C G 1 0 0 1 0 0
    — rs186479088 21142278 C T 1 0 0 1 0 0
    — rs74869454 21142287 G T 1 0 0 1 0 0
    — rs149199470 21142301 T C 1 0 0 1 0 0

    This is the second file using 2 people. First genotypes corresponds to the genotypes above. These results seem more reasonable as they are not all 1 0 0. However, all of the highest probability genotypes reported here are the same as the corresponding genotypes above.

    — rs178040 21141734 C A 0.971 0.029 0 0.973 0.027 0
    — rs145392423 21141843 G A 0.993 0.007 0 0.992 0.008 0
    — rs192316527 21141926 C A 1 0 0 1 0 0
    — rs184379047 21141963 A G 0.998 0.002 0 0.998 0.002 0
    — rs79903906 21142064 T C 0.985 0.015 0 0.985 0.015 0
    — rs190160446 21142206 G A 1 0 0 1 0 0
    — rs181865789 21142227 C G 0.999 0.001 0 0.998 0.002 0
    — rs186479088 21142278 C T 1 0 0 1 0 0
    — rs74869454 21142287 G T 0.955 0.044 0 0.957 0.043 0
    — rs149199470 21142301 T C 0.995 0.005 0 0.994 0.006 0 }

    Thank you very much for helping me with the imputation. Imputing with Impute2 will give me 40 million SNPs!

    • I’m glad that you’re getting this to work, hopefully at some point soon I’ll write another post summing all this up. And it’s great that you seem totally at ease using the command line!

      Some comments on your last post:
      You can ‘correct’ the 23andme genotyping using the –pgs option on impute2:
      ‘”Predict Genotyped SNPs”: Tells the program to replace the input genotypes from the -g file with imputed genotypes in the -o file (applies to Type 2 SNPs only).’

      The Minor Allele Program I think would be able to use the whole range of SNPs, it would just require a larger input reference MAF/dbSNP file than the one he provides.

      Where the imputed site is 0.998 CC at a very low MAF: the definition of r^2 is in the paper http://www.g3journal.org/content/1/6/457.full.pdf

      ‘we assessed accuracy at each SNP as the squared Pearson correlation (R2) between the masked genotypes, which take values in {0,1,2}, and the imputed allele dosages (also known as posterior mean genotypes), which take values in [0,2]. The allele dosage is defined for each genotype G as P2
      sum from x=0 to 2 of x * Pr (G = x) where Pr (G = x) is a marginal posterior probability generated by an imputation method. Once the correlation R2 had been measured for every masked SNP, we calculated the mean R2 across SNPs and reported this as a scalar summary of imputation accuracy in that cross-validation experiment. In rare situations, the correlation at a SNP was undefined because the imputation produced identical allele dosages for all individuals. In these cases, we set R2 = 0 to capture the intuition that there would be no power to detect an effect at such SNPs.’

      So therefore we expect it to call the most common genotype as it has done, just if you do actually have the rare variant then this is unlikely to be imputed correctly. i.e. r^2 is a measure of the power to predict variants, as the reference is the default

      Finally, the results from replicating the same individual in the -g file. Yes, this looks wrong to me. I expect by having two people with exactly the same calls in the input reinforces the haplotype they are estimated to have, when in fact this data isn’t real. A cool thing would be if 23andme ran imputation as part of their standard data analysis, and then we could see what it looks like running many samples at once through the input.

  5. J1 said:

    My concordance rates for the imputation using my 23andme file along with 10 23andme files from OpenSNP was between 20 and 70%. These concordances are terrible.

    Did you run impute2 and obtain concordances in the 90+%?

    If so, what setup did you use?
    Whom did you include in the g file?
    What impute2 commands did you use?

    • jofunu6 said:

      That’s not so dissimilar from what I got when I tried on 23andme data I’m afraid. Problems could be due to the number and choice of SNPs typed on 23andme, and also the small number of samples being imputed.
      Also if the data being imputed isn’t from one of the reference populations it will do worse.

      I’m hoping to go back to imputation in the next month, and I will post here if I manage to get better concordance rates.

  6. J1 said:

    This is just awesome! Thank you very much for picking up on this. I do not know why it has taken so long to move this one forward. Impute2 was there right off the shelf!

    It was just a question of having a bigger computer and tweaking the concordances and quality control.

    Something that I am interested in now is phasing an exome file.
    A family member has an exome VCF file and we would love to separate out the variants into phased chromosomes. This is sort of what imputing does, though at the end of imputation this phasing information is lost.

    It would be so great to have this information as it would help quite a bit in genealogy and in searching for the origin of genetic traits.

    Keep up the good work!

  7. Dave Clancy said:

    Hello

    do you know if impute2 can impute mtDNA?

    If so, are there examples in the literature? I’m struggling to find any.

    with thanks
    Dave Clancy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s