bam2peakshape.py - compute peak shape features from a bam-file¶

Tags: Genomics NGS Intervals BAM BED Summary

Purpose¶

This script takes a bed formatted file with regions of interest, for example binding intervals from a ChIP-Seq experiment. Using a collection of aligned reads is a bam formatted file or bigwig formatted file, the script outputs a collection of features describing the peak shape.

This script is designed with a slight emphasis on ChIP-Seq datasets. The main reason that this script is better suited for ChIP-Seq is that(1) it is able to center the counting window at the summit of every individual peak; (2) it is also able to use the control ChIP-Seq library to enable side-by-side comparison of treatment vs control;(3) it can randomly shift the set of input regions to generate a artificial set of regions, in the absence of real ChIP-Seq control library, the random regions can provide a peaks profile that can be used as the control.

For example, given the peaks regions defined by analyzing some ChIP-Seq dataset (e.g. by using MACS), and without the need to use any additional genomic annotations (e.g. ENSEMBL, refseq), we can visualise the binding profiles of transcriptionfactors ChIP-Seq data relative to the center of each peak regions.

The script outputs a tab-separated table on stdout containing features for each interval. A peak is defined as the location of the highest density in an interval. The width of the peak (peak_width) is defined as the region around the peak in which the density does not drop below a threshold of peak_heigt * 90%.

Usage¶

Detailed usage example¶

The following command will generate the peak shape plot for the peak regions defined in onepeak.bed, using the reads stored in small.bam. The command will also create a profile for the control library. The control library in this example is re-using the same reads file small.bam, however, in your actual experiment, it should be a different library (the input library for this ChIP-Seq experiment).:

python ./scripts/bam2peakshape.py         ./tests/bam2peakshape.py/small.bam         ./tests/bam2peakshape.py/onepeak.bed         --control-bam-file=./tests/bam2peakshape.py/small.bam         --use-interval         --normalize-transcript

Output files¶

Among the features output are:

Column	Content
peak_height	number of reads at peak
peak_median	median coverage compared to peak height
interval_width	width of interval
peak_width	width of peak
bins	bins for a histogram of densities within the interval.
npeaks	number of density peaks in interval.
peak_center	point of highest density in interval
peak_relative_pos	point of highest density in interval coordinates
counts	counts for a histogram of densities within the interval
furthest_half_heigh	Distance of peak center to furthest half-height position
closest_half_height	Distance of peak center to closest half-height position

Additionally, the script outputs a set of matrixes with densities over intervals that can be used for plotting. The default filenames are (matrix|control)_<sortorder>.tsv.gz, The names can be controlled with the --output-filename-pattern option.

Type:

python bam2peakshape.py --help

for command line help.

Options¶

Option: Shift¶

shift the each read by a certain distance, because in a ChIP-Seq experment, the read is always at the edge of an sonicated fragment, the actual binding site is usually L/2 distance away from the read, where L is the length of sonicated fragment (determined either experimentally or computationally).

This option is used only if the input reads are in bam formatted file. If input reads are bigwig formatted file, this option is ignored.

Option: Random shift¶

randomly shift the set of input regions to generate a artificial set of regions. In the absence of real ChIP-Seq control library, the random regions can provide a peaks profile that can be used as the control.

Option: Centring method¶

“reads” will output in the way that the summit of the peaks are aligned. “middle” will output in the way that the middle of the input bed intervals are aligned.

Option: Only interval¶

Only count reads that are in the interval as defined by the input bed file.

Option: normalization=sum¶

normalize counts such that the sum of all counts in all features are exactly 1000000.

The detail normalization algorithm as follows: norm = sum(all counts in all features)/1000000.0 normalized count = normalized count / norm

Todo

paired-endedness is not fully implemented.

Command line options¶

usage: bam2peakshape [-h] [--version] [-f {bam,bigwig}] [-o] [-w WINDOW_SIZE]
                     [-b BIN_SIZE] [--smooth-method {none,sum,sg}]
                     [-s {peak-height,peak-width,unsorted,interval-width,interval-score}]
                     [-c CONTROL_FILES] [-r] [-e {reads,middle}]
                     [-n {none,sum}] [--use-strand] [-i SHIFT]
                     [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                     [--timeit-header] [--random-seed RANDOM_SEED]
                     [-v LOGLEVEL] [--log-config-filename LOG_CONFIG_FILENAME]
                     [--tracing {function}] [-? ?]
                     [-P OUTPUT_FILENAME_PATTERN] [-F] [-I STDIN] [-L STDLOG]
                     [-E STDERR] [-S STDOUT]
bam2peakshape: error: argument -?: expected one argument