IndexedGenome.py - Random access to interval lists

This module provides a consistent front-end to various interval containers.

Two implementations are available:

NCL

Nested containment lists as described in http://bioinformatics.oxfordjournals.org/content/23/11/1386.short. The implemenation was taken from pygr.

quicksect

Quicksect algorithm used in Galaxy, see here. This requires python.bx to be installed. The benefit of quicksect is that it allows also quick retrieval of intervals that are closest before or after an query.

The principal clas is IndexedGenome which uses NCL and stores a value associated with each interval. Quicksect is equivalent to IndexedGenome but uses quicksect. The Simple is a light-weight version of IndexedGenome that does not store a value and thus preserves space.

The basic usage is:

from IndexedGenome import IndexedGenome
index = IndexedGenome()
for contig, start, end, value in intervals:
   index.add(contig, start, end, value)

print index.contains("chr1", 1000, 2000)
print index.get("chr1", 10000, 20000)

The index is built in memory.

Reference

class IndexedGenome.IndexedGenome

Bases: object

Genome with indexed intervals.

index_factory

alias of cgat.NCL.NCL

get(contig, start, end)

return intervals overlapping with key.

class IndexedGenome.Simple(*args, **kwargs)

Bases: IndexedGenome.IndexedGenome

index intervals without storing a value.

index_factory

alias of cgat.NCL.NCLSimple

class IndexedGenome.Quicksect(*args, **kwargs)

Bases: IndexedGenome.IndexedGenome

index intervals using quicksect.

Permits finding closest interval in case there is no overlap.

get(contig, start, end)

return intervals overlapping with key.

before(contig, start, end, num_intervals=1, max_dist=2500)

get closest interval before start.

after(contig, start, end, num_intervals=1, max_dist=2500)

get closest interval after end.