index_fasta.py - Index fasta formatted files¶
- Tags
Genomics Sequences FASTA Manipulation
Purpose¶
This script indexes one or more fasta formatted files into a
database that can be used by other scripts in the cgat code collection
and IndexedFasta
for quick access to a particular part of a sequence.
This is very useful for large genomic sequences.
By default, the database is itself a fasta formatted file in which all line breaks and other white space characters have been removed. Compression methods are available to conserve disk space, though they do come at a performance penalty.
The script implements several indexing and compression methods. The default method uses no compression and builds a simple random access index based on a table of absolute file positions. The sequence is stored in a plain fasta file with one line per sequence allowing to extract a sequence segment by direct file positioning.
Alternatively, the sequence can be block-compressed using different compression methods (gzip, lzo, bzip). These are mostly for research purposes.
See also http://pypi.python.org/pypi/pyfasta for another
implementation. Samtools provides similar functionality with the
samtools faidx
command and block compression has been implemented
in the `bgzip http://samtools.sourceforge.net/tabix.shtml>`_ tool.
The script permits supplying synonyms to the database index. For
example, setting --synonyms=chrM=chrMT
will ensure that the
mitochondrial genome sequence is returned both for the keys chrM
and chrMT
.
Examples
Index a collection of fasta files in a compressed archive:
python index_fasta.py oa_ornAna1_softmasked ornAna1.fa.gz > oa_ornAna1_softmasked.log
To retrieve a segment:
python index_fasta.py --extract=chr5:1000:2000 oa_ornAna1_softmasked
Indexing from a tar file is possible:
python index_fasta.py oa_ornAna1_softmasked ornAna1.tar.gz > oa_ornAna1_softmasked.log
Indexing from stdin requires to use the -
place-holder:
zcat ornAna1.fa.gz | python index_fasta.py oa_ornAna1_softmasked - > oa_ornAna1_softmasked.log
Usage¶
Type:
cgat index_genome DATABASE [SOURCE...|-] [OPTIONS]
cgat index_genome DATABASE [SOURCE...|-] --compression=COMPRESSION --random-access-points=100000
To create indexed DATABASE from SOURCE. Supply - as SOURCE to read from stdin. If the output is to be compressed, a spacing for the random access points must be supplied.
Type:
cgat index_genome DATABASE --extract=CONTIG:[STRAND]:START:END
To extract the bases on the STRAND strand, between START to END from entry CONTIG, from DATABASE.
Command line options¶
usage: index-fasta [-h] [--version] [-e EXTRACT]
[-i {one-forward-open,zero-both-open}] [-s SYNONYMS] [-b]
[--benchmark-num-iterations BENCHMARK_NUM_ITERATIONS]
[--benchmark-fragment-size BENCHMARK_FRAGMENT_SIZE]
[--verify VERIFY]
[--verify-iterations VERIFY_NUM_ITERATIONS]
[--file-format {fasta,auto,fasta.gz,tar,tar.gz}] [-a]
[--allow-duplicates] [--regex-identifier REGEX_IDENTIFIER]
[--force-output] [-t {solexa,phred,bytes,range200}]
[-c {lzo,zlib,gzip,dictzip,bzip2,debug}]
[--random-access-points RANDOM_ACCESS_POINTS]
[--compress-index] [--timeit TIMEIT_FILE]
[--timeit-name TIMEIT_NAME] [--timeit-header]
[--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
index-fasta: error: argument -?: expected one argument