index_fasta.py - Index fasta formatted files

Tags

Genomics Sequences FASTA Manipulation

Purpose

This script indexes one or more fasta formatted files into a database that can be used by other scripts in the cgat code collection and IndexedFasta for quick access to a particular part of a sequence. This is very useful for large genomic sequences.

By default, the database is itself a fasta formatted file in which all line breaks and other white space characters have been removed. Compression methods are available to conserve disk space, though they do come at a performance penalty.

The script implements several indexing and compression methods. The default method uses no compression and builds a simple random access index based on a table of absolute file positions. The sequence is stored in a plain fasta file with one line per sequence allowing to extract a sequence segment by direct file positioning.

Alternatively, the sequence can be block-compressed using different compression methods (gzip, lzo, bzip). These are mostly for research purposes.

See also http://pypi.python.org/pypi/pyfasta for another implementation. Samtools provides similar functionality with the samtools faidx command and block compression has been implemented in the `bgzip http://samtools.sourceforge.net/tabix.shtml>`_ tool.

The script permits supplying synonyms to the database index. For example, setting --synonyms=chrM=chrMT will ensure that the mitochondrial genome sequence is returned both for the keys chrM and chrMT.

Examples

Index a collection of fasta files in a compressed archive:

python index_fasta.py oa_ornAna1_softmasked ornAna1.fa.gz > oa_ornAna1_softmasked.log

To retrieve a segment:

python index_fasta.py --extract=chr5:1000:2000 oa_ornAna1_softmasked

Indexing from a tar file is possible:

python index_fasta.py oa_ornAna1_softmasked ornAna1.tar.gz > oa_ornAna1_softmasked.log

Indexing from stdin requires to use the - place-holder:

zcat ornAna1.fa.gz | python index_fasta.py oa_ornAna1_softmasked - > oa_ornAna1_softmasked.log

Usage

Type:

cgat index_genome DATABASE [SOURCE...|-] [OPTIONS]
cgat index_genome DATABASE [SOURCE...|-] --compression=COMPRESSION --random-access-points=100000

To create indexed DATABASE from SOURCE. Supply - as SOURCE to read from stdin. If the output is to be compressed, a spacing for the random access points must be supplied.

Type:

cgat index_genome DATABASE --extract=CONTIG:[STRAND]:START:END

To extract the bases on the STRAND strand, between START to END from entry CONTIG, from DATABASE.

Command line options

usage: index-fasta [-h] [--version] [-e EXTRACT]
                   [-i {one-forward-open,zero-both-open}] [-s SYNONYMS] [-b]
                   [--benchmark-num-iterations BENCHMARK_NUM_ITERATIONS]
                   [--benchmark-fragment-size BENCHMARK_FRAGMENT_SIZE]
                   [--verify VERIFY]
                   [--verify-iterations VERIFY_NUM_ITERATIONS]
                   [--file-format {fasta,auto,fasta.gz,tar,tar.gz}] [-a]
                   [--allow-duplicates] [--regex-identifier REGEX_IDENTIFIER]
                   [--force-output] [-t {solexa,phred,bytes,range200}]
                   [-c {lzo,zlib,gzip,dictzip,bzip2,debug}]
                   [--random-access-points RANDOM_ACCESS_POINTS]
                   [--compress-index] [--timeit TIMEIT_FILE]
                   [--timeit-name TIMEIT_NAME] [--timeit-header]
                   [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                   [--log-config-filename LOG_CONFIG_FILENAME]
                   [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                   [-E STDERR] [-S STDOUT]
index-fasta: error: argument -?: expected one argument