bam2peakshape.py - compute peak shape features from a bam-file¶
- Tags
Genomics NGS Intervals BAM BED Summary
Purpose¶
This script takes a bed formatted file with regions of interest, for example binding intervals from a ChIP-Seq experiment. Using a collection of aligned reads is a bam formatted file or bigwig formatted file, the script outputs a collection of features describing the peak shape.
This script is designed with a slight emphasis on ChIP-Seq datasets. The main reason that this script is better suited for ChIP-Seq is that(1) it is able to center the counting window at the summit of every individual peak; (2) it is also able to use the control ChIP-Seq library to enable side-by-side comparison of treatment vs control;(3) it can randomly shift the set of input regions to generate a artificial set of regions, in the absence of real ChIP-Seq control library, the random regions can provide a peaks profile that can be used as the control.
For example, given the peaks regions defined by analyzing some ChIP-Seq dataset (e.g. by using MACS), and without the need to use any additional genomic annotations (e.g. ENSEMBL, refseq), we can visualise the binding profiles of transcriptionfactors ChIP-Seq data relative to the center of each peak regions.
The script outputs a tab-separated table on stdout containing features for each interval. A peak is defined as the location of the highest density in an interval. The width of the peak (peak_width) is defined as the region around the peak in which the density does not drop below a threshold of peak_heigt * 90%.
Usage¶
Detailed usage example¶
The following command will generate the peak shape plot for the peak
regions defined in onepeak.bed
, using the reads stored in
small.bam
. The command will also create a profile for the
control library. The control library in this example is re-using the
same reads file small.bam
, however, in your actual experiment,
it should be a different library (the input library for this ChIP-Seq
experiment).:
python ./scripts/bam2peakshape.py ./tests/bam2peakshape.py/small.bam ./tests/bam2peakshape.py/onepeak.bed --control-bam-file=./tests/bam2peakshape.py/small.bam --use-interval --normalize-transcript
Output files¶
Among the features output are:
Column |
Content |
peak_height |
number of reads at peak |
peak_median |
median coverage compared to peak height |
interval_width |
width of interval |
peak_width |
width of peak |
bins |
bins for a histogram of densities within the interval. |
npeaks |
number of density peaks in interval. |
peak_center |
point of highest density in interval |
peak_relative_pos |
point of highest density in interval coordinates |
counts |
counts for a histogram of densities within the interval |
furthest_half_heigh |
Distance of peak center to furthest half-height position |
closest_half_height |
Distance of peak center to closest half-height position |
Additionally, the script outputs a set of matrixes with densities over
intervals that can be used for plotting. The default filenames are
(matrix|control)_<sortorder>.tsv.gz
, The names can be controlled
with the --output-filename-pattern
option.
Type:
python bam2peakshape.py --help
for command line help.
Options¶
Option: Shift¶
shift the each read by a certain distance, because in a ChIP-Seq experment, the read is always at the edge of an sonicated fragment, the actual binding site is usually L/2 distance away from the read, where L is the length of sonicated fragment (determined either experimentally or computationally).
This option is used only if the input reads are in bam formatted file. If input reads are bigwig formatted file, this option is ignored.
Option: Random shift¶
randomly shift the set of input regions to generate a artificial set of regions. In the absence of real ChIP-Seq control library, the random regions can provide a peaks profile that can be used as the control.
Option: Centring method¶
“reads” will output in the way that the summit of the peaks are aligned. “middle” will output in the way that the middle of the input bed intervals are aligned.
Option: Only interval¶
Only count reads that are in the interval as defined by the input bed file.
Option: normalization=sum¶
normalize counts such that the sum of all counts in all features are exactly 1000000.
The detail normalization algorithm as follows: norm = sum(all counts in all features)/1000000.0 normalized count = normalized count / norm
Todo
paired-endedness is not fully implemented.
Command line options¶
usage: bam2peakshape [-h] [--version] [-f {bam,bigwig}] [-o] [-w WINDOW_SIZE]
[-b BIN_SIZE] [--smooth-method {none,sum,sg}]
[-s {peak-height,peak-width,unsorted,interval-width,interval-score}]
[-c CONTROL_FILES] [-r] [-e {reads,middle}]
[-n {none,sum}] [--use-strand] [-i SHIFT]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED]
[-v LOGLEVEL] [--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?]
[-P OUTPUT_FILENAME_PATTERN] [-F] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
bam2peakshape: error: argument -?: expected one argument