gff2histogram.py - compute histograms from intervals in gff or bed format¶
- Tags
Genomics Intervals GFF Summary
Purpose¶
This script computes distributions of interval sizes, intersegmental distances and interval overlap from a list of intervals in gff or bed format.
The output will be written into separate files. Filenames are given by
--ouput-filename-pattern
.
Available methods are:
- hist
Output a histogram of interval sizes and distances between intervals in nucleotides.
- stats
Output summary statistics of interval sizes and distances between intervals
- values
Output distances, sizes, and overlap values to separate files.
- all
all of the above.
Usage¶
For example, a small gff file such as this (note that intervals need to be sorted by position):
chr19 processed_transcript exon 60105 60162 . - .
chr19 processed_transcript exon 60521 60747 . - .
chr19 processed_transcript exon 65822 66133 . - .
chr19 processed_transcript exon 66346 66416 . - .
chr19 processed_transcript exon 66346 66509 . - .
will give when called as:
cgat gff2histogram < in.gff
the following output files:
- hist
Histogram of feature sizes and distances between adjacent features
residues
size
distance
58.0
1
na
71.0
1
na
164.0
1
na
212.0
na
1
227.0
1
na
312.0
1
na
358.0
na
1
5074.0
na
1
stats
Summary statistics of the distribution of feature size and distance between adjacent features.
data
nval
min
max
mean
median
stddev
sum
q1
q3
size
5
58.0000
312.0000
166.4000
164.0000
95.6339
832.0000
71.0000
227.0000
distance
3
212.0000
5074.0000
1881.3333
358.0000
2258.3430
5644.0000
212.0000
5074.0000
overlaps
A file with features that overlap other features, here:
chr19 processed_transcript exon 66346 66416 . - . chr19 processed_transcript exon 66346 66509 . - .
Type:
python gff2histogram.py --help
for command line help.
Command line options¶
usage: gff2histogram [-h] [--version] [-b BIN_SIZE] [--min-value MIN_VALUE]
[--max-value MAX_VALUE] [--no-empty-bins]
[--with-empty-bins] [--ignore-out-of-range]
[--missing-value MISSING_VALUE] [--use-dynamic-bins]
[--format {gff,gtf,bed}]
[--method {all,hist,stats,overlaps,values}]
[--output-section {all,size,distance}]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED]
[-v LOGLEVEL] [--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?]
[-P OUTPUT_FILENAME_PATTERN] [-F] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
gff2histogram: error: argument -?: expected one argument