GTF.py - Classes and methods for dealing with GTF/GFF formatted files

The coordinates are kept internally in python coordinates (0-based, open-closed), but are output as inclusive 1-based coordinates according to http://www.sanger.ac.uk/Software/formats/GFF/.

The default GTF version is 2.2.

This module uses pysam to provide the principal engine for iterating over files (iterate()). As a consequence, the returned objects are of type pysam.GTFProxy().

The class defined in this model Entry is useful for re-formatting records.

Apart from basic iteration, this module provides the following utilities:

GTF.iterator(infile)

return a simple iterator over all entries in a file.

GTF.track_iterator(infile)

a simple iterator over all entries in a file.

GTF.chunk_iterator(gff_iterator)

iterate over the contents of a gff file.

return entries as single element lists

GTF.iterator_contigs(gffs)

iterate over contigs.

TODO: implement as coroutines

GTF.transcript_iterator(gff_iterator, strict=True)

iterate over the contents of a gtf file.

return a list of entries with the same transcript id.

Any features without a transcript_id will be ignored.

The entries for the same transcript have to be consecutive in the file. If strict is set an AssertionError will be raised if that is not true.

GTF.joined_iterator(gff_iterator, group_field=None)

iterate over the contents of a gff file.

return a list of entries with the same group id. Note: the entries have to be consecutive in the file.

GTF.gene_iterator(gff_iterator, strict=True)

iterate over the contents of a gtf file.

return a list of transcripts with the same gene id.

Note: the entries have to be consecutive in the file, i.e, first sorted by transcript and then by gene id.

Genes with the same name on different contigs are resolved separately in strict = False.

GTF.flat_gene_iterator(gff_iterator, strict=True)

iterate over the contents of a gtf file.

return a list of entries with the same gene id.

Note: the entries have to be consecutive in the file, i.e, sorted by gene_id

Genes with the same name on different contigs are resolved separately in strict = False

GTF.merged_gene_iterator(gff_iterator)

iterate over the contents of a gtf file.

Each gene is merged into a single entry spanning the whole stretch that a gene covers.

Note: the entries have to be consecutive in the file, i.e, sorted by gene_id

GTF.iterator_filtered(gff_iterator, feature=None, source=None, contig=None, interval=None, strand=None)

iterate over the contents of a gff file.

yield only entries for a given feature

GTF.iterator_sorted_chunks(gff_iterator, sort_by='contig-start')

iterate over chunks in a sorted order

sort_by can be

contig-start

sort by position ignoring the strand

contig-strand-start

sort by position taking the strand into account

contig-strand-start-end

intervals with the same start position will be sorted by end position

returns the chunks.

GTF.iterator_min_feature_length(gff_iterator, min_length, feature='exon')

select only those genes with a minimum length of a given feature.

GTF.iterator_sorted(gff_iterator, sort_order='gene')

sort input and yield sorted output.

GTF.iterator_overlapping_genes(gtf_iterator, min_overlap=0)

return overlapping genes.

GTF.iterator_transcripts2genes(gtf_iterator, min_overlap=0)

cluster transcripts by exon overlap.

The gene id is set to the first transcript encountered of a gene. If a gene stretches over several contigs, subsequent copies are appended a number.

GTF.iterator_overlaps(gff_iterator, min_overlap=0)

iterate over gff file and return a list of features that are overlapping.

The input should be sorted by contig,start

GTF.Overlap(entry1, entry2, min_overlap=0)

returns true, if entry1 and entry2 overlap by a minimum number of residues.

GTF.Identity(entry1, entry2, max_slippage=0)

returns true, if entry1 and entry2 are (almost) identical, allowing a small amount of slippage at either end.

GTF.HalfIdentity(entry1, entry2, max_slippage=0)

returns true, if entry1 and entry2 overlap and at least one end is within max_slippage residues.

GTF.asRanges(gffs, feature=None)

return ranges within a set of gffs.

Overlapping intervals are merged.

The returned intervals are sorted.

GTF.CombineOverlaps(old_gff, method='combine')

combine overlapping entries for a list of gffs.

method can be any of combine|longest|shortest only the first letter is important.

GTF.SortPerContig(gff)

sort gff entries per contig and return a dictionary mapping a contig to the begin of the list.

GTF.toIntronIntervals(chunk)

convert a set of gtf elements within a transcript to intron coordinates.

Will use first transcript_id found.

Note that coordinates will still be forward strand coordinates

GTF.toSequence(chunk, fasta)

convert a list of gff attributes to a single sequence.

This function ensures correct in-order concatenation on positive/negative strand. Overlapping regions are merged.

GTF.readFromFile(infile)

read records from file and return as list.

GTF.readAsIntervals(gff_iterator, with_values=False, with_records=False, merge_genes=False, with_gene_id=False, with_transcript_id=False, use_strand=False)

read tuples of (start, end) from a GTF file.

This method ignores everything else but the coordinates.

The with_values, with_gene_id and with_records options are exclusive.

Parameters
  • gff_iterator (iterator) – Iterator yielding GTF records.

  • with_values – If True, the content of the score field is added to the tuples.

  • with_records – If True, the entire record is added to the tuples.

  • merge_genes – If true, the GTF records are passed through the :func: merged_gene_iterator iterator first.

  • with_gene_id – If True, the gene_id is added to the tuples.

  • with_transcript_id – If True, the transcript_ids are added to the tuples.

  • use_strand – If true, intervals will be grouped by contig and strand. The default is to group by contig only.

  • a dictionary of intervals by contig. (Returns) –

GTF.readAndIndex(iterator, with_value=True)

read from gtf stream and index.

Returns

an object of type IndexedGenome.IndexedGenome

Return type

index

exception GTF.Error

Bases: Exception

Base class for exceptions in this module.

exception GTF.ParsingError(message)

Bases: GTF.Error

Exception raised for errors in the input.

message -- explanation of the error
GTF.toDot(v)

convert value to ‘.’ if None

GTF.quote(v)

return a quoted attribute.

class GTF.Entry

Bases: object

representation of a GTF formatted entry.

contig

Chromosome/contig

Type

string

source

The GTF source field

Type

string

feature

The GTF feature field

Type

string

frame

The frame

Type

string

start

Start coordinate in 0-based coordinates, half-open coordinates

Type

int

end

End coordinate in 0-based coordinates, half-open coordinates

Type

int

score

Score associated with feature

Type

float

strand

Strand of feature

Type

string

gene_id

Gene identifier of feature. Not present for GFF formatted data.

Type

string

transcript_id

Transcript identifier of feature. Not present for GFF formatted data.

Type

string

attributes

Dictionary of additional attributes in the GFF/GTF record (last column)

Type

dict

read(line)

read gff entry from line in GTF/GFF format.

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

parseInfo(attributes, line)

parse attributes.

This method will set the gene_id and transcript_id attributes if present.

invert(lcontig)

invert genomic coordinates from forward to reverse coordinates and back.

Parameters

lcontig (int) – Length of the chromosome that the feature resides on.

fromGTF(other, gene_id=None, transcript_id=None)

fill record from other GFF/GTF entry.

The optional attributes are not copied.

fromBed(other, **kwargs)

fill record from a bed entry.

copy(other)

fill from other entry.

This method works if other is GTF.Entry or pysam.GTFProxy.

asDict()

return attributes as a dictionary.

hasOverlap(other, min_overlap=0)

returns true, if overlap with other entry.

isIdentical(other, max_slippage=0)

returns true, if self and other overlap completely.

isHalfIdentical(other, max_slippage=0)

returns true, if self and other overlap.