IndexedFasta.py - fast random access in fasta files¶
This module provides fast random access to fasta formatted files that have been previously indexed. The indexing can be done either through the samtools faidx tool (accessible through pysam) or using the in-house methods implemented in this module.
The main class is IndexedFasta
. This is a factory function
that provides transparent access to both samtools or cgat indexed
fasta files. The basic usage to retrieve the sequence spanning the
region chr12:10,000-10,100 is:
from IndexedFasta import IndexedFasta
fasta = IndexedFasta("hg19")
fasta.getSequence("chr12", "+", 10000, 10100)
To index a file, use the scripts/index_fasta
command line utility or the
createDatabase()
function:
> python index_fasta.py hg19 chr*.fa
This module has some useful utility functions:
splitFasta()
split a fasta formatted file into smaller pieces.
parseCoordinates()
parse a coordinate string in various formats
but otherwise the module contains a multitude of additional functions that are only of internal use.
Reference¶
-
IndexedFasta.
writeFragments
(outfile_fasta, outfile_index, fragments, mangler, size, write_all=False)¶ write mangled fragments to outfile_fasta in chunks of size updating outfile_index.
returns part of last fragment that has not been written and is less than size and the number of fragments output.
If write_all is True, all of the fragments are written to the file and the last file position is added to outfile_index as well.
-
class
IndexedFasta.
TranslatorPhred
(*args, **kwargs)¶ Bases:
IndexedFasta.Translator
translate phred quality scores.
-
class
IndexedFasta.
TranslatorSolexa
(*args, **kwargs)¶ Bases:
IndexedFasta.Translator
translate solexa quality scores.
-
class
IndexedFasta.
TranslatorRange200
(*args, **kwargs)¶ Bases:
IndexedFasta.Translator
translate pcap quality scores.
For example for PCAP scores.
These scores range from 0 to 100 and are the “a weighted sum of input base quality values (Huang and Madan 1999)
The numerical values from 0 to 200 are stored as values form 33 to 233 “
-
class
IndexedFasta.
TranslatorBytes
(*args, **kwargs)¶ Bases:
IndexedFasta.Translator
output binary values as bytes permitting values from 0 to 255
Note the resulting file will not be iterable as newline is not a record-separator any more.
-
IndexedFasta.
createDatabase
(db, iterator, force=False, synonyms=None, compression=None, random_access_points=None, regex_identifier=None, clean_sequence=False, ignore_duplicates=False, allow_duplicates=False, translator=None)¶ index files in filenames to create database.
Two new files are created - db.fasta and db_name.idx
If compression is enabled, provide random access points every # bytes.
Dictzip is treated as an uncompressed file.
regex_identifier: pattern to extract identifier from description line. If None, the part until the first white-space character is used.
translator: specify a translator
-
class
IndexedFasta.
cgatIndexedFasta
(dbname)¶ Bases:
object
an indexed fasta file.
-
setTranslator
(translator=None)¶ set the
Translator
to use.
-
getDatabaseName
()¶ returns the name of the database.
-
getToken
(contig)¶ check if token is in index.
-
getLength
(contig)¶ return sequence length for sbjct_token.
-
getLengths
()¶ return all sequence lengths.
-
compressIndex
()¶ compress index. Creates a database interface to an index.
-
getContigs
()¶ return a list of contigs (no synonyms).
-
getContigSizes
(with_synonyms=True)¶ return hash with contig sizes including synonyms.
-
setConverter
(converter)¶ set converter from coordinate system to 0-based, both strand, open/closed coordinate system.
-
getSequence
(contig, strand='+', start=0, end=0, converter=None, as_array=False)¶ get a genomic fragment.
A genomic fragment is identified by the coordinates contig, strand, start, end.
The converter function supplied translated these coordinates into 0-based coordinates. By default, start and end are assumed to be pythonic coordinates and are forward/reverse coordinates.
If as_array is set to true, return the AString object. This might be beneficial for large sequence chunks. If as_array is set to False, return a python string.
-
getRandomCoordinates
(size)¶ returns coordinates for a random fragment of size #.
The coordinates are forward/reverse.
Default sampling mode:
Each residue has the same probability of being in a fragment. Thus, the fragment can be smaller than size due to contig boundaries.
-
-
class
IndexedFasta.
PysamIndexedFasta
(dbname)¶ Bases:
IndexedFasta.cgatIndexedFasta
interface a pysam/samtools indexed fasta file with the cgatIndexedFasta API.
-
getSequence
(contig, strand='+', start=0, end=0, converter=None, as_array=False)¶ get a genomic fragment.
A genomic fragment is identified by the coordinates contig, strand, start, end.
The converter function supplied translated these coordinates into 0-based coordinates. By default, start and end are assumed to be pythonic coordinates and are forward/reverse coordinates.
If as_array is set to true, return the AString object. This might be beneficial for large sequence chunks. If as_array is set to False, return a python string.
-
-
IndexedFasta.
IndexedFasta
(dbname, *args, **kwargs)¶ factory function for IndexedFasta objects.
-
IndexedFasta.
getConverter
(format)¶ return a converter function for converting various coordinate schemes into 0-based, both strand, closed-open ranges.
converter functions have the parameters x, y, s, l: with x and y the coordinates of a sequence fragment, s the strand (True is positive) and l being the length of the contig.
Format is a “-” separated combination of the keywords “one”, “zero”, “forward”, “both”, “open”, “closed”:
zero/one: zero or one-based coordinates forward/both: forward coordinates or forward/reverse coordinates open/closed: half-open intervals (pythonic) or closed intervals
-
IndexedFasta.
benchmarkRandomFragment
(fasta, size)¶ returns a random fragment of size.
-
IndexedFasta.
verify
(fasta1, fasta2, num_iterations, fragment_size, stdout=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, quiet=False)¶ verify two databases.
Get segment from fasta1 and check for presence in fasta2.
-
IndexedFasta.
splitFasta
(infile, chunk_size, dir='/tmp', pattern=None)¶ split a fasta file into a subset of files.
If pattern is not given, random file names are chosen.
-
IndexedFasta.
parseCoordinates
(s)¶ parse a coordinate string.
The coordinate string can be various formats, such as
chr1:+:10:1000
,chr1:10..1000
.- Returns
contig (string) – The chromosome/contig.
strand (char) – Strand. If not present, set to “+”.
start (int) – Start of interval
end (int) – End of interval. If not present, set to start + 1.