gff2coverage.py - compute genomic coverage of gff intervals¶
- Tags
Genomics Intervals Summary GFF
Purpose¶
This script computes the genomic coverage of intervals in a gff formatted file. The coverage is computed per feature.
Usage¶
You can use two methods to compute the coverage: genomic and histogram.
Let us explain their usage with this small.gtf
file:
19 processed_transcript exon 16 16 . - . gene_id
19 processed_transcript exon 27 27 . - . gene_id
19 processed_transcript exon 8 8 . - . gene_id
19 processed_transcript exon 19 19 . - . gene_id
19 processed_transcript exon 5 5 . - . gene_id
and this toy example (small.fasta
) of an indexed fasta file:
>chr19
GCCGGCCTCTACCTGCAGCAGATGCCCTAT
Both files (small.gtf
and small.fasta
) are included
in the GitHub repository.
genomic method¶
The genomic method computes the coverage of intervals accross the genome file given as input. Let us see how to apply the genomic method to the small examples above:
python gff2coverage.py --method=genomic --genome-file=small < small.gtf
The output (wrapped to fit here) will be:
contig source feature intervals bases p_coverage total_p_coverage
19 trans. exon 5 5 16.666667 16.666667
As you can see the information displayed is the following: contig name, source, feature name, number of intervals within the contig, number of bases, percentage of coverage in the contig, and percentage of coverage in the genome file.
histogram method¶
On the contrary, if you want to compute the coverage of intervals within the gff file itself summarized as an histogram and grouped by contig name, please use the histogram method.
To use the histogram method with the input files above, please type:
python gff2coverage.py --method=histogram --window=5 --features=exon --output-filename-pattern=%s.hist < small.gtf
In this case the output (written to file 19.hist
) is:
abs_pos rel_pos abs_exon rel_exon
0 0.0000 1 0.2000
5 0.1852 2 0.4000
10 0.3704 2 0.4000
15 0.5556 4 0.8000
20 0.7407 4 0.8000
25 0.9259 5 1.0000
The output is given as a pair of columns. The first pair of columns always
appears and lists the cumulative numbers of nucleotides in each window or
bin –absolute and relative values in the former and latter columns,
respectively. The subsequent pair of columns depends on the values given to
the --features
option. In this example there is an extra column for the
exon
feature but you could especify as many of them as you wanted among
those features listed in your gff file.
On the other hand, the --num-bins
option can be used instead of
--window
along with --genome-file
to define the number of bins for the
resultant histogram. This parameter is used by default (with value: 1000)
when using the histogram method.
Please note the following:
you need to specify the feature name explicitly (with the
--feature
option) to compute the genomic coverage of that feature. You can also usea comma-separated list of feature names.the output of the histogram method goes to a file (in the current workingdirectory) which is named as the contig name by default. To change thisbehaviour, please use the
--output-filename-pattern
option where%s
will be substituted by the contig name.
Command line options¶
usage: gff2coverage [-h] [--version] [-g GENOME_FILE] [-f FEATURES]
[-w WINDOW_SIZE] [-b NUM_BINS] [-m {genomic,histogram}]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED]
[-v LOGLEVEL] [--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
[-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
gff2coverage: error: argument -?: expected one argument