gff2histogram.py - compute histograms from intervals in gff or bed format

Tags

Genomics Intervals GFF Summary

Purpose

This script computes distributions of interval sizes, intersegmental distances and interval overlap from a list of intervals in gff or bed format.

The output will be written into separate files. Filenames are given by --ouput-filename-pattern.

Available methods are:

hist

Output a histogram of interval sizes and distances between intervals in nucleotides.

stats

Output summary statistics of interval sizes and distances between intervals

values

Output distances, sizes, and overlap values to separate files.

all

all of the above.

Usage

For example, a small gff file such as this (note that intervals need to be sorted by position):

chr19   processed_transcript    exon    60105   60162   .       -       .
chr19   processed_transcript    exon    60521   60747   .       -       .
chr19   processed_transcript    exon    65822   66133   .       -       .
chr19   processed_transcript    exon    66346   66416   .       -       .
chr19   processed_transcript    exon    66346   66509   .       -       .

will give when called as:

cgat gff2histogram < in.gff

the following output files:

hist

Histogram of feature sizes and distances between adjacent features

residues

size

distance

58.0

1

na

71.0

1

na

164.0

1

na

212.0

na

1

227.0

1

na

312.0

1

na

358.0

na

1

5074.0

na

1

stats

Summary statistics of the distribution of feature size and distance between adjacent features.

data

nval

min

max

mean

median

stddev

sum

q1

q3

size

5

58.0000

312.0000

166.4000

164.0000

95.6339

832.0000

71.0000

227.0000

distance

3

212.0000

5074.0000

1881.3333

358.0000

2258.3430

5644.0000

212.0000

5074.0000

overlaps

A file with features that overlap other features, here:

chr19   processed_transcript    exon    66346   66416   .       -       .       chr19   processed_transcript    exon    66346   66509   .       -       .

Type:

python gff2histogram.py --help

for command line help.

Command line options

usage: gff2histogram [-h] [--version] [-b BIN_SIZE] [--min-value MIN_VALUE]
                     [--max-value MAX_VALUE] [--no-empty-bins]
                     [--with-empty-bins] [--ignore-out-of-range]
                     [--missing-value MISSING_VALUE] [--use-dynamic-bins]
                     [--format {gff,gtf,bed}]
                     [--method {all,hist,stats,overlaps,values}]
                     [--output-section {all,size,distance}]
                     [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                     [--timeit-header] [--random-seed RANDOM_SEED]
                     [-v LOGLEVEL] [--log-config-filename LOG_CONFIG_FILENAME]
                     [--tracing {function}] [-? ?]
                     [-P OUTPUT_FILENAME_PATTERN] [-F] [-I STDIN] [-L STDLOG]
                     [-E STDERR] [-S STDOUT]
gff2histogram: error: argument -?: expected one argument