The ANFO ancient DNA mapper
Where to start?
Get the prerequisites, compile, create documentation, test. Also see the Github Wiki.
Not too ancient G++; 4.0 and later work fine, 3.x might cause somedifficulties. Also make, a shell, and so on.
protobuf. Find it at http://code.google.com/p/protobuf
popt. You might already have it, else ask your admin or grab it athttp://rpm5.org/files/popt
Doxygen, if you want the documentation. Can be found at http://doxygen.org
zlib from zlib.net
libbz2 from bzip2.org
Elk from http://sam.zoy.org/elk/
If you have those, call './configure', 'make', then 'make doc' to createthe documentation, then 'make install'.
How to use it?
To test, you need a genome, an index, and some input in FastA or FastQ format:
Locate a genome in FASTA format (or any other format that has a converter;.2bit is fine, my own .dna isn't). Run fa2dna on it (--help tellsyou how); you can pipe the input if you want.
On your new .dna file, run file-info. It should spit out the meta data forthat genome. If not, you're hosed...
Build an index for the .dna file using dnaindex (--help tells youhow, the defaults are probably fine). Use only the basename of the genome,it will figure the extension out by itself.
You can run file-info on that, too. It will tell you the word sizeyou just set. If not, you're hosed again...
Write a sensible anfo.cfg using the examples in example/ andconfig.proto as guideline.
You can run index-test or anfo now. Index-test just does lookups inthe index, and optionally (on simulated data) checks if a seed in thecorrect region was found.
anfo can be run on any FASTA/FASTQ file now.
anfo-tool can operate on the output files (but it's cumbersome touse)
the Elk binding can do what anfo and anfo-tool do, is much moreflexible, but undocumented.
What's stable and what isn't?
Pretty much stabilized are...
Genome handling and creation (files sequence.h, sequence.cc, index.h, index.cc, fa2dna.cc)
I store genomes with four bits per nucleotide, which sounded like afscking brilliant idea, since it's only half as wasteful as the textform and still allows ambiguity codes. The headaches came with theimplementation of the auxilliary DnaP class... Anyway, thenucleotides A,C,T,G map to bits 0,1,2,3. That order (you did noticeT coming before G, didn't you?) also sounded like a fscking brilliantidea, but in realilty it doesn't matter and I keep mixing it up.Chromosomes are split into contigs at long stretches of Ns, contigsare separated by single gaps and the first and last ones terminate ina gap at either side. This means you can start anywhere and safelyrun forward or backward until you hit a gap.
Index handling and creation (files index.h, index.cc, dnaindex.cc)
The index is quite simple: oligos are mapped to integer offsets, afirst level array contains a pointer to a second array for each ofthe possible oligos, and the second level array contains one longlist of positions where these oligos were found. Only the forwardstrand is indexed, but lookup is of course done for both strands.(This primitive thing seems such a waste when the much betterFM-index family is just out of reach... to bad, for now.)
Handling of reads (files sequence.h and sequence.cc)
Just a simple structure for sequences with quality scores and aFastA/FastQ/FourQ reader. There's no support for mate pairs, Ihaven't even decided what to do about them.
File formats (genome, index, config file, output)
Genome and index files are documented somewhere in the code.Metadata, config, complex output, etc. is encapsulated in protobufmessages. That way I don't need to mess with parsers and prettyprinters and the messages are extensible. The two .proto files aremore or less finalized, but it's easy to change them without breakingstuff, which also means you should think before changing them.
The aligner (align.h, align.cc)
Completely rewritten now, it is much faster than the previous versionand practically finished. The general structure and the specificalignment mechanisms are somewhat separated from each other, makingextensions like a special 454 aligner possible at least in principle.
Anfo is (C) 2009 by Udo Stenzel email@example.com
ANFO is free software: you can redistribute it and/or modifyit under the terms of the GNU General Public License as published bythe Free Software Foundation, either version 3 of the License, or(at your option) any later version.
This program is distributed in the hope that it will be useful,but WITHOUT ANY WARRANTY; without even the implied warranty ofMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See theGNU General Public License for more details.
You should have received a copy of the GNU General Public Licensealong with this program. If not, see http://www.gnu.org/licenses/.
To restore the repository download the bundle
git clone mpieva-anfo_-_2014-07-07_11-54-54.bundle
Upload date: 2014-07-07
- 2021-07-06 17:03:16
- 2014-07-07 11:54:54
- Internet Archive Python library 1.9.9
- iagitup - v1.6.2