github.com-nathan-nakatsuka-ContamLD_-_2021-03-05_22-11-18
Item Preview
Share or Embed This Item
Flag this item for
- Publication date
- 2021-03-05
None
ContamLD
ContamLD is a software designed to estimate autosomal contamination in ancient DNA samples. The software requires Python 3 and R (any version should suffice).
Citation: Nakatsuka, N.*; Harney, E.*; Mallick, S..; Mah, M.; Patterson, N.; Reich, D. "Estimation of Ancient Nuclear DNA Contamination Using Breakdown of Linkage Disequilibrium." BioRxiv.
Contact: Nathan Nakatsuka: nathan_nakatsuka@hms.harvard.edu
Steps for use:
Section 1) Pre-processing steps
Part 1) Download panels or prepare your own panels.
Step 1) Download panels from https://reichdata.hms.harvard.edu/pub/datasets/release/contamLD
Note: In most cases you should download the 1240K panels. If you have low coverage (<0.5X) whole-genome shotgun sequences, then you can try the SG_panels for improved power at the expense of significantly increased running time and memory requirements.
Step 2) Put the panels in the "panels" folder where the "helperdir" folder also is located (referred to as "directory_orig" below).
Note: If you have a SNP set that is very different from the 1240K SNP set or whole-genome shotgun set then follow steps in "HowtoMakeYourOwnPanel.txt" in the "PreProcessing" folder to make your own panel.
Part 2) Pull down reads onto SNP set.
ContamLD requires 2 sets of files, one with all reads and one with only damaged reads, both in the following format (tab delimited):
SNPID Chrom* Position* REF ALT :: IndName* REFallelecount* ALTallelecount*
Chrom is chromosome number. Position is the position of the SNP on the chromosome in Hg19 coordinates. REFallelecount and ALTallelecount refer to the number of reads mapping to the reference or alternative allele, respectively (based on Hg19). * indicates the necessary columns (the other columns can be filled with place holders, but the Chrom, Position, IndName, REFallelecount, and ALTallelecount must be in the 2nd, 3rd, 7th, 8th, and 9th columns, respectively). The files should have reads corresponding to the 1240K.snp or SG.snp files in the "PreProcessing" folder, or the snp files corresponding to the panels prepared by the user.
Step 1) To obtain damaged reads, use PMDtools (https://github.com/pontussk/PMDtools) with a PMDscore threshold of 3.
Step 2) The read count information can be obtained in any of the following ways:
Note: Each readdepth file must contain only a single individual to be compatible for Section 2.
*Option 1)* Use samtools mpileup on BAM files with the appropriate base and mapping quality cutoffs (e.g. 30).
*Option 2)* If you have eigenstrat files and are unable to pull down read information from BAM files, use the "eig2readdepth.py" script in the "PreProcessing" folder to transform eigenstrat files to readdepth files.
(Note: this has less power than the read based method because it ignores reads that map to the same site)
-Put files in the format: IndNameAll.snp, IndNameAll.ind, IndNameAll.geno and damaged reads: IndNamedam.snp, IndNamedam.ind, IndNamedam.geno
Use this file for eigenstrat format files in pseudo-haploid format (one read chosen to represent the genotype, either 0=ALT or 2=REF; no heterozygotes). If you have diploid data with heterozygotes, use the -d flag. In the script below, "Prefix" stands for the prefix before .snp, .ind, .geno, so IndNameAll or IndName_dam.pythonpython eig2readdepth.py [-d] Prefix
Future Option 3) Before the end of 2020 we plan to release a pulldown program that will allow users to automatically generate the readdepth files, including damage restricted versions, required for ContamLD (as well as genotype files for other purposes).
Step 3) Name the files "IndNameAll.readdepth" and "IndNamedam.readdepth" (IndName is the name of that particular individual. IndNameAll.readdepth is the file corresponding to all reads, and IndNamedam.readdepth is the file corresponding to only damaged reads for that individual.).
Part 3) Determine what panel the target individual is genetically closest to.
Follow "HowtoDetermineClosestPanel_Outgroupf3.txt" instructions in the "PreProcessing" folder to run outgroup-f3 statistics to determine which panel is genetically closest to the target individual.
Note: Guessing on this step is okay as long as the sample is within continental ancestry variation of the 1000 Genomes population.
Part 4) Create a file with the names of all individuals and their corresponding panels.
Create file called "GroupNameinds.txt" (where GroupName is the name of your collection of individuals) in the following format, where the 1000Genomes population is determined from Section 1 Part 3, and put it in the same directory as the readdepth files:
IndName1 1000GenomesPopclosesttoIndName1
IndName2 1000GenomesPopclosesttoIndName2
IndName3 1000GenomesPopclosesttoIndName_3
Section 2) Run Contamination Estimate
After you have readdepth files, the panels (placed in the panels folder), and the GroupNameinds.txt file, run ContamLD.
Note: Run this with 3 cores if possible.
-In the following notation: "directoryorig" is the directory where the helperdir and panels folders are; "directoryfiles" is the directory where your .readdepth and GroupNameinds.txt files are; "Panel_Type" is the type of panel: 1240K, SG, or your own.
Run the following:
```cd directoryfilesmkdir -p directoriescd directoryorig
!/bin/bash
while read IndName panel; dobash directoryorig/helperdir/ContamLDRunningScript.txt directoryorig directoryfiles ${IndName} ${panel} PanelTypedone < directoryfiles/GroupName_inds.txt```
Section 3) Post-processing
After ContamLD is run, the final values and standard errors are obtained with this script.Note: The script will automatically do both the damage correction and the external correction version. Set "ExternalCorrectionValue" to be the external correction score of an uncontaminated individual of the same group as your target individual. If you do not have this, set the score to 0. Panel_Type is the type of panel: 1240K, SG, or your own.
Note: The first time this script is run, sometimes it has an error because the files are not yet finished before they are needed for another script. If this happens, re-run the script. If it does not work the second time, then something went wrong upstream of this.cd directory_origbash directory_orig/helperdir/Post_Processing_New.txt directory_orig directory_files GroupName_inds.txt External_Correction_Value Panel_Type
Examining the output of ContamLD:
The files of interest will be in the directory "directoryfiles". The damage restriction corrected estimates will be named "FinalContamScoresKdamageratioPanelTypeGroupNameinds.txt". The external correction estimates will be named "FinalContamScoresExtCorrPanelTypeGroupNameinds.txt"
If the warning "ModelMisspecified" shows up, this usually means the coverage is very low and the estimate might not be reliable. The warning "VeryHighContamination" sometimes comes up even if there is low contamination if there is contamination from another ancient source or the panel is very diverged from the ancestry of the target individual (e.g. this occurs with most Native American individuals and some African groups).
Example to test:
There is an example to test in the folder "exampledir". These are 2 samples (I7210 and I7278) with 0.04 contamination from an ancient West Eurasian (I10895).
If you run these samples with the Example_inds.txt file, you should get approximately the following estimates:
I7210: 0.039
I7278: 0.040
The files that should be generated are in the "directories" folder in the "exampledir" folder. To test if ContamLD is running properly for you, move these files to another location and see if the files you generate are approximately the same as these files.
To restore the repository download the bundle
wget https://archive.org/download/github.com-nathan-nakatsuka-ContamLD_-_2021-03-05_22-11-18/nathan-nakatsuka-ContamLD_-_2021-03-05_22-11-18.bundle
and run: git clone nathan-nakatsuka-ContamLD_-_2021-03-05_22-11-18.bundle
Source: https://github.com/nathan-nakatsuka/ContamLD
Uploader: nathan-nakatsuka
Upload date: 2021-03-05
- Addeddate
- 2021-07-06 17:13:12
- Identifier
- github.com-nathan-nakatsuka-ContamLD_-_2021-03-05_22-11-18
- Originalurl
-
https://github.com/nathan-nakatsuka/ContamLD
- Pushed_date
- 2021-03-05 22:11:18
- Scanner
- Internet Archive Python library 1.9.9
- Uploaded_with
- iagitup - v1.6.2
- Year
- 2021