GTF.py - Classes and methods for dealing with GTF/GFF formatted files¶
The coordinates are kept internally in python coordinates (0-based, open-closed), but are output as inclusive 1-based coordinates according to http://www.sanger.ac.uk/Software/formats/GFF/.
The default GTF version is 2.2.
This module uses pysam to provide the principal engine for iterating over
files (iterate()
). As a consequence, the returned objects are of
type pysam.GTFProxy()
.
The class defined in this model Entry
is useful for re-formatting
records.
Apart from basic iteration, this module provides the following utilities:
Additional iterators for grouping/modifying GTF formatted files:
track_iterator()
,chunk_iterator()
,iterator_contigs()
,transcript_iterator()
,joined_iterator()
,gene_iterator()
,flat_gene_iterator()
,merged_gene_iterator()
,iterator_filtered()
,iterator_sorted_chunks()
,iterator_min_feature_length()
,iterator_sorted()
iterator_overlapping_genes()
,iterator_transcripts2genes()
iterator_overlaps()
Compare intervals:
Identity()
,HalfIdentity()
,Overlap()
Read GTF formatted files and optionally index them:
readFromFile()
,readAsIntervals()
,readAndIndex()
Manipulate lists of GTF records:
asRanges()
,CombineOverlaps()
,SortPerContig()
,toIntronIntervals()
,toSequence()
-
GTF.
iterator
(infile)¶ return a simple iterator over all entries in a file.
-
GTF.
track_iterator
(infile)¶ a simple iterator over all entries in a file.
-
GTF.
chunk_iterator
(gff_iterator)¶ iterate over the contents of a gff file.
return entries as single element lists
-
GTF.
iterator_contigs
(gffs)¶ iterate over contigs.
TODO: implement as coroutines
-
GTF.
transcript_iterator
(gff_iterator, strict=True)¶ iterate over the contents of a gtf file.
return a list of entries with the same transcript id.
Any features without a transcript_id will be ignored.
The entries for the same transcript have to be consecutive in the file. If strict is set an AssertionError will be raised if that is not true.
-
GTF.
joined_iterator
(gff_iterator, group_field=None)¶ iterate over the contents of a gff file.
return a list of entries with the same group id. Note: the entries have to be consecutive in the file.
-
GTF.
gene_iterator
(gff_iterator, strict=True)¶ iterate over the contents of a gtf file.
return a list of transcripts with the same gene id.
Note: the entries have to be consecutive in the file, i.e, first sorted by transcript and then by gene id.
Genes with the same name on different contigs are resolved separately in strict = False.
-
GTF.
flat_gene_iterator
(gff_iterator, strict=True)¶ iterate over the contents of a gtf file.
return a list of entries with the same gene id.
Note: the entries have to be consecutive in the file, i.e, sorted by gene_id
Genes with the same name on different contigs are resolved separately in strict = False
-
GTF.
merged_gene_iterator
(gff_iterator)¶ iterate over the contents of a gtf file.
Each gene is merged into a single entry spanning the whole stretch that a gene covers.
Note: the entries have to be consecutive in the file, i.e, sorted by gene_id
-
GTF.
iterator_filtered
(gff_iterator, feature=None, source=None, contig=None, interval=None, strand=None)¶ iterate over the contents of a gff file.
yield only entries for a given feature
-
GTF.
iterator_sorted_chunks
(gff_iterator, sort_by='contig-start')¶ iterate over chunks in a sorted order
sort_by can be
- contig-start
sort by position ignoring the strand
- contig-strand-start
sort by position taking the strand into account
- contig-strand-start-end
intervals with the same start position will be sorted by end position
returns the chunks.
-
GTF.
iterator_min_feature_length
(gff_iterator, min_length, feature='exon')¶ select only those genes with a minimum length of a given feature.
-
GTF.
iterator_sorted
(gff_iterator, sort_order='gene')¶ sort input and yield sorted output.
-
GTF.
iterator_overlapping_genes
(gtf_iterator, min_overlap=0)¶ return overlapping genes.
-
GTF.
iterator_transcripts2genes
(gtf_iterator, min_overlap=0)¶ cluster transcripts by exon overlap.
The gene id is set to the first transcript encountered of a gene. If a gene stretches over several contigs, subsequent copies are appended a number.
-
GTF.
iterator_overlaps
(gff_iterator, min_overlap=0)¶ iterate over gff file and return a list of features that are overlapping.
The input should be sorted by contig,start
-
GTF.
Overlap
(entry1, entry2, min_overlap=0)¶ returns true, if entry1 and entry2 overlap by a minimum number of residues.
-
GTF.
Identity
(entry1, entry2, max_slippage=0)¶ returns true, if entry1 and entry2 are (almost) identical, allowing a small amount of slippage at either end.
-
GTF.
HalfIdentity
(entry1, entry2, max_slippage=0)¶ returns true, if entry1 and entry2 overlap and at least one end is within max_slippage residues.
-
GTF.
asRanges
(gffs, feature=None)¶ return ranges within a set of gffs.
Overlapping intervals are merged.
The returned intervals are sorted.
-
GTF.
CombineOverlaps
(old_gff, method='combine')¶ combine overlapping entries for a list of gffs.
method can be any of combine|longest|shortest only the first letter is important.
-
GTF.
SortPerContig
(gff)¶ sort gff entries per contig and return a dictionary mapping a contig to the begin of the list.
-
GTF.
toIntronIntervals
(chunk)¶ convert a set of gtf elements within a transcript to intron coordinates.
Will use first transcript_id found.
Note that coordinates will still be forward strand coordinates
-
GTF.
toSequence
(chunk, fasta)¶ convert a list of gff attributes to a single sequence.
This function ensures correct in-order concatenation on positive/negative strand. Overlapping regions are merged.
-
GTF.
readFromFile
(infile)¶ read records from file and return as list.
-
GTF.
readAsIntervals
(gff_iterator, with_values=False, with_records=False, merge_genes=False, with_gene_id=False, with_transcript_id=False, use_strand=False)¶ read tuples of (start, end) from a GTF file.
This method ignores everything else but the coordinates.
The with_values, with_gene_id and with_records options are exclusive.
- Parameters
gff_iterator (iterator) – Iterator yielding GTF records.
with_values – If True, the content of the score field is added to the tuples.
with_records – If True, the entire record is added to the tuples.
merge_genes – If true, the GTF records are passed through the :func: merged_gene_iterator iterator first.
with_gene_id – If True, the gene_id is added to the tuples.
with_transcript_id – If True, the transcript_ids are added to the tuples.
use_strand – If true, intervals will be grouped by contig and strand. The default is to group by contig only.
a dictionary of intervals by contig. (Returns) –
-
GTF.
readAndIndex
(iterator, with_value=True)¶ read from gtf stream and index.
- Returns
an object of type
IndexedGenome.IndexedGenome
- Return type
index
-
exception
GTF.
Error
¶ Bases:
Exception
Base class for exceptions in this module.
-
exception
GTF.
ParsingError
(message)¶ Bases:
GTF.Error
Exception raised for errors in the input.
-
message -- explanation of the error
-
-
GTF.
toDot
(v)¶ convert value to ‘.’ if None
-
GTF.
quote
(v)¶ return a quoted attribute.
-
class
GTF.
Entry
¶ Bases:
object
representation of a GTF formatted entry.
-
contig
¶ Chromosome/contig
- Type
string
-
source
¶ The GTF source field
- Type
string
-
feature
¶ The GTF feature field
- Type
string
-
frame
¶ The frame
- Type
string
-
strand
¶ Strand of feature
- Type
string
-
read
(line)¶ read gff entry from line in GTF/GFF format.
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
-
parseInfo
(attributes, line)¶ parse attributes.
This method will set the gene_id and transcript_id attributes if present.
-
invert
(lcontig)¶ invert genomic coordinates from forward to reverse coordinates and back.
- Parameters
lcontig (int) – Length of the chromosome that the feature resides on.
-
fromGTF
(other, gene_id=None, transcript_id=None)¶ fill record from other GFF/GTF entry.
The optional attributes are not copied.
-
fromBed
(other, **kwargs)¶ fill record from a bed entry.
-
asDict
()¶ return attributes as a dictionary.
-
hasOverlap
(other, min_overlap=0)¶ returns true, if overlap with other entry.
-
isIdentical
(other, max_slippage=0)¶ returns true, if self and other overlap completely.
-
isHalfIdentical
(other, max_slippage=0)¶ returns true, if self and other overlap.
-