bam2peakshape.py - compute peak shape features from a bam-file

Tags

Genomics NGS Intervals BAM BED Summary

Purpose

This script takes a bed formatted file with regions of interest, for example binding intervals from a ChIP-Seq experiment. Using a collection of aligned reads is a bam formatted file or bigwig formatted file, the script outputs a collection of features describing the peak shape.

This script is designed with a slight emphasis on ChIP-Seq datasets. The main reason that this script is better suited for ChIP-Seq is that(1) it is able to center the counting window at the summit of every individual peak; (2) it is also able to use the control ChIP-Seq library to enable side-by-side comparison of treatment vs control;(3) it can randomly shift the set of input regions to generate a artificial set of regions, in the absence of real ChIP-Seq control library, the random regions can provide a peaks profile that can be used as the control.

For example, given the peaks regions defined by analyzing some ChIP-Seq dataset (e.g. by using MACS), and without the need to use any additional genomic annotations (e.g. ENSEMBL, refseq), we can visualise the binding profiles of transcriptionfactors ChIP-Seq data relative to the center of each peak regions.

The script outputs a tab-separated table on stdout containing features for each interval. A peak is defined as the location of the highest density in an interval. The width of the peak (peak_width) is defined as the region around the peak in which the density does not drop below a threshold of peak_heigt * 90%.

Usage

Detailed usage example

The following command will generate the peak shape plot for the peak regions defined in onepeak.bed, using the reads stored in small.bam. The command will also create a profile for the control library. The control library in this example is re-using the same reads file small.bam, however, in your actual experiment, it should be a different library (the input library for this ChIP-Seq experiment).:

python ./scripts/bam2peakshape.py         ./tests/bam2peakshape.py/small.bam         ./tests/bam2peakshape.py/onepeak.bed         --control-bam-file=./tests/bam2peakshape.py/small.bam         --use-interval         --normalize-transcript

Output files

Among the features output are:

Column

Content

peak_height

number of reads at peak

peak_median

median coverage compared to peak height

interval_width

width of interval

peak_width

width of peak

bins

bins for a histogram of densities within the interval.

npeaks

number of density peaks in interval.

peak_center

point of highest density in interval

peak_relative_pos

point of highest density in interval coordinates

counts

counts for a histogram of densities within the interval

furthest_half_heigh

Distance of peak center to furthest half-height position

closest_half_height

Distance of peak center to closest half-height position

Additionally, the script outputs a set of matrixes with densities over intervals that can be used for plotting. The default filenames are (matrix|control)_<sortorder>.tsv.gz, The names can be controlled with the --output-filename-pattern option.

Type:

python bam2peakshape.py --help

for command line help.

Options

Option: Shift

shift the each read by a certain distance, because in a ChIP-Seq experment, the read is always at the edge of an sonicated fragment, the actual binding site is usually L/2 distance away from the read, where L is the length of sonicated fragment (determined either experimentally or computationally).

This option is used only if the input reads are in bam formatted file. If input reads are bigwig formatted file, this option is ignored.

Option: Random shift

randomly shift the set of input regions to generate a artificial set of regions. In the absence of real ChIP-Seq control library, the random regions can provide a peaks profile that can be used as the control.

Option: Centring method

“reads” will output in the way that the summit of the peaks are aligned. “middle” will output in the way that the middle of the input bed intervals are aligned.

Option: Only interval

Only count reads that are in the interval as defined by the input bed file.

Option: normalization=sum

normalize counts such that the sum of all counts in all features are exactly 1000000.

The detail normalization algorithm as follows: norm = sum(all counts in all features)/1000000.0 normalized count = normalized count / norm

Todo

paired-endedness is not fully implemented.

Command line options

usage: bam2peakshape [-h] [--version] [-f {bam,bigwig}] [-o] [-w WINDOW_SIZE]
                     [-b BIN_SIZE] [--smooth-method {none,sum,sg}]
                     [-s {peak-height,peak-width,unsorted,interval-width,interval-score}]
                     [-c CONTROL_FILES] [-r] [-e {reads,middle}]
                     [-n {none,sum}] [--use-strand] [-i SHIFT]
                     [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                     [--timeit-header] [--random-seed RANDOM_SEED]
                     [-v LOGLEVEL] [--log-config-filename LOG_CONFIG_FILENAME]
                     [--tracing {function}] [-? ?]
                     [-P OUTPUT_FILENAME_PATTERN] [-F] [-I STDIN] [-L STDLOG]
                     [-E STDERR] [-S STDOUT]
bam2peakshape: error: argument -?: expected one argument