IndexedFasta.py - fast random access in fasta files

This module provides fast random access to fasta formatted files that have been previously indexed. The indexing can be done either through the samtools faidx tool (accessible through pysam) or using the in-house methods implemented in this module.

The main class is IndexedFasta. This is a factory function that provides transparent access to both samtools or cgat indexed fasta files. The basic usage to retrieve the sequence spanning the region chr12:10,000-10,100 is:

from IndexedFasta import IndexedFasta
fasta = IndexedFasta("hg19")
fasta.getSequence("chr12", "+", 10000, 10100)

To index a file, use the scripts/index_fasta command line utility or the createDatabase() function:

> python index_fasta.py hg19 chr*.fa

This module has some useful utility functions:

splitFasta()

split a fasta formatted file into smaller pieces.

parseCoordinates()

parse a coordinate string in various formats

but otherwise the module contains a multitude of additional functions that are only of internal use.

Reference

IndexedFasta.writeFragments(outfile_fasta, outfile_index, fragments, mangler, size, write_all=False)

write mangled fragments to outfile_fasta in chunks of size updating outfile_index.

returns part of last fragment that has not been written and is less than size and the number of fragments output.

If write_all is True, all of the fragments are written to the file and the last file position is added to outfile_index as well.

class IndexedFasta.Translator

Bases: object

translate a sequence.

class IndexedFasta.TranslatorPhred(*args, **kwargs)

Bases: IndexedFasta.Translator

translate phred quality scores.

class IndexedFasta.TranslatorSolexa(*args, **kwargs)

Bases: IndexedFasta.Translator

translate solexa quality scores.

class IndexedFasta.TranslatorRange200(*args, **kwargs)

Bases: IndexedFasta.Translator

translate pcap quality scores.

For example for PCAP scores.

These scores range from 0 to 100 and are the “a weighted sum of input base quality values (Huang and Madan 1999)

The numerical values from 0 to 200 are stored as values form 33 to 233 “

class IndexedFasta.TranslatorBytes(*args, **kwargs)

Bases: IndexedFasta.Translator

output binary values as bytes permitting values from 0 to 255

Note the resulting file will not be iterable as newline is not a record-separator any more.

IndexedFasta.createDatabase(db, iterator, force=False, synonyms=None, compression=None, random_access_points=None, regex_identifier=None, clean_sequence=False, ignore_duplicates=False, allow_duplicates=False, translator=None)

index files in filenames to create database.

Two new files are created - db.fasta and db_name.idx

If compression is enabled, provide random access points every # bytes.

Dictzip is treated as an uncompressed file.

regex_identifier: pattern to extract identifier from description line. If None, the part until the first white-space character is used.

translator: specify a translator

class IndexedFasta.cgatIndexedFasta(dbname)

Bases: object

an indexed fasta file.

setTranslator(translator=None)

set the Translator to use.

getDatabaseName()

returns the name of the database.

getToken(contig)

check if token is in index.

getLength(contig)

return sequence length for sbjct_token.

getLengths()

return all sequence lengths.

compressIndex()

compress index. Creates a database interface to an index.

getContigs()

return a list of contigs (no synonyms).

getContigSizes(with_synonyms=True)

return hash with contig sizes including synonyms.

setConverter(converter)

set converter from coordinate system to 0-based, both strand, open/closed coordinate system.

getSequence(contig, strand='+', start=0, end=0, converter=None, as_array=False)

get a genomic fragment.

A genomic fragment is identified by the coordinates contig, strand, start, end.

The converter function supplied translated these coordinates into 0-based coordinates. By default, start and end are assumed to be pythonic coordinates and are forward/reverse coordinates.

If as_array is set to true, return the AString object. This might be beneficial for large sequence chunks. If as_array is set to False, return a python string.

getRandomCoordinates(size)

returns coordinates for a random fragment of size #.

The coordinates are forward/reverse.

Default sampling mode:

Each residue has the same probability of being in a fragment. Thus, the fragment can be smaller than size due to contig boundaries.

class IndexedFasta.PysamIndexedFasta(dbname)

Bases: IndexedFasta.cgatIndexedFasta

interface a pysam/samtools indexed fasta file with the cgatIndexedFasta API.

getSequence(contig, strand='+', start=0, end=0, converter=None, as_array=False)

get a genomic fragment.

A genomic fragment is identified by the coordinates contig, strand, start, end.

The converter function supplied translated these coordinates into 0-based coordinates. By default, start and end are assumed to be pythonic coordinates and are forward/reverse coordinates.

If as_array is set to true, return the AString object. This might be beneficial for large sequence chunks. If as_array is set to False, return a python string.

IndexedFasta.IndexedFasta(dbname, *args, **kwargs)

factory function for IndexedFasta objects.

IndexedFasta.getConverter(format)

return a converter function for converting various coordinate schemes into 0-based, both strand, closed-open ranges.

converter functions have the parameters x, y, s, l: with x and y the coordinates of a sequence fragment, s the strand (True is positive) and l being the length of the contig.

Format is a “-” separated combination of the keywords “one”, “zero”, “forward”, “both”, “open”, “closed”:

zero/one: zero or one-based coordinates forward/both: forward coordinates or forward/reverse coordinates open/closed: half-open intervals (pythonic) or closed intervals

IndexedFasta.benchmarkRandomFragment(fasta, size)

returns a random fragment of size.

IndexedFasta.verify(fasta1, fasta2, num_iterations, fragment_size, stdout=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, quiet=False)

verify two databases.

Get segment from fasta1 and check for presence in fasta2.

IndexedFasta.splitFasta(infile, chunk_size, dir='/tmp', pattern=None)

split a fasta file into a subset of files.

If pattern is not given, random file names are chosen.

IndexedFasta.parseCoordinates(s)

parse a coordinate string.

The coordinate string can be various formats, such as chr1:+:10:1000, chr1:10..1000.

Returns

  • contig (string) – The chromosome/contig.

  • strand (char) – Strand. If not present, set to “+”.

  • start (int) – Start of interval

  • end (int) – End of interval. If not present, set to start + 1.