bam_vs_bed.py - count context that reads map to

Tags

Genomics NGS Intervals BAM BED Counting

Purpose

This script takes as input a BAM file from an RNA-seq or similar experiment and a bed formatted file. The bed formatted file needs at least four columns. The fourth (name) column is used to group counts.

The script counts the number of alignments overlapping in the first input file that overlap each feature in the second file. Annotations in the bed file can be overlapping - they are counted independently.

Note that duplicate intervals will be counted multiple times. This situation can easily arise when building a set of genomic annotations based on a geneset with alternative transcripts. For example:

chr1     10000     20000     protein_coding            # gene1, transrcipt1
chr1     10000     20000     protein_coding            # gene1, transcript2

Any reads overlapping the interval chr1:10000-20000 will be counted twice into the protein_coding bin by bedtools. To avoid this, remove any duplicates from the bed file:

zcat input_with_duplicates.bed.gz | cgat bed2bed --merge-by-name | bgzip > input_without_duplicates.bed.gz

This scripts requires bedtools to be installed.

Options

-a, –bam-file / -b, –bed-file

These are the input files. They can also be provided as provided as positional arguements, with the bam file being first and the (gziped or uncompressed) bed file coming second

-m, --min-overlap

Using this option will only count reads if they overlap with a bed entry by a certain minimum fraction of the read.

Example

Example:

python bam_vs_bed.py in.bam in.bed.gz

Usage

Type:

cgat bam_vs_bed BAM BED [OPTIONS]
cgat bam_vs_bed --bam-file=BAM --bed-file=BED [OPTIONS]

where BAM is either a bam or bed file and BED is a bed file.

Type:

cgat bam_vs_bed --help

for command line help.

Command line options

usage: bam-vs-bed [-h] [--version] [-m MIN_OVERLAP] [-a bam] [-b bed] [-s]
                  [--assume-sorted] [--split-intervals] [--timeit TIMEIT_FILE]
                  [--timeit-name TIMEIT_NAME] [--timeit-header]
                  [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                  [--log-config-filename LOG_CONFIG_FILENAME]
                  [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                  [-E STDERR] [-S STDOUT]
bam-vs-bed: error: argument -?: expected one argument