bam2geneprofile.py - build meta-gene profile for a set of transcripts/genes

Tags

Genomics NGS Genesets Intervals GTF BAM Summary

Purpose

This script takes a gtf formatted file, a short reads bam formatted file and computes meta-gene profiles over various annotations derived from the gtf file.

A meta-gene profile is an abstract genomic entity over which reads stored in a bam formatted file have been counted. A meta-gene might be an idealized eukaryotic gene (upstream, exonic sequence, downstream) or any other genomic landmark of interest such as transcription start sites.

The script can be used to visualize binding profiles of a chromatin mark in gene bodies, binding of transcription factors in promotors or sequencing bias (e.g. 3’ bias) in RNA-Seq data.

This script is designed with a slight emphasis on RNA-Seq datasets. For example, it takes care of spliced reads, by using the CIGAR string in the BAM file to accurately define aligned bases (if the –base-accurate is specified, currently this feature is turned off by default for speed reasons).

Alternatively, for the purpose of visualizing binding profiles of transcription factor ChIP-Seq without the need to use any genomic annotations (ENSEMBL, or refseq), you may also consider using bam2peakshape.py - compute peak shape features from a bam-file, which is designed with a slight emphasis on Chip-Seq datasets. For example, bam2peakshape.py - compute peak shape features from a bam-file is able to center the counting window to the summit of every individual peak. bam2peakshape.py - compute peak shape features from a bam-file is also able to: (1) plot the control ChIP-Seq library to enable side-by-side comparison; (2) randomize the given regions to provide a semi-control.

Usage

Quick start examples

The following command will generate the gene profile plot similar to Fig 1(a) in the published cgat paper, but using a test dataset that is much smaller and simpler than the dataset used for publishing the cgat paper.

python ./scripts/bam2geneprofile.py
    --bam-file=./tests/bam2geneprofile.py/multipleReadsSplicedOutAllIntronsAndSecondExon.bam
    --gtf-file=./tests/bam2geneprofile.py/onegeneWithoutAnyCDS.gtf.gz
    --method=geneprofile
    --reporter=gene

In the following, a slightly more involved example will use more features of this script. The following command generate the gene profile showing base accuracy of upstream (500bp), exons, introns and downstream(500bp) of a gene model from some user supplied RNA-Seq data and geneset.

python ./scripts/bam2geneprofile.py
    --bam-file=./rnaseq.bam
    --gtf-file=./geneset.gtf.gz
    --method=geneprofilewithintrons
    --reporter=gene
    --extension-upstream=500
    --resolution-upstream=500
    --extension-downstream=500
    --resolution-downstream=500

The output will contain read coverage over genes. The profile will contain four separate segments:

  1. the upstream region of a gene ( set to be 500bp ), (--extension-upstream=500).

  2. the transcribed region of a gene. The transcribed region of every gene will be scaled to 1000 bp (default), shrinking longer transcripts and expanding shorter transcripts.

  3. the intronic regions of a gene. These will be scaled to 1000b (default).

  4. the downstream region of a gene (set to be 500bp), (--extension-downstream=500).

Detailed explaination

The bam2geneprofile.py script reads in a set of transcripts from a gtf formatted file. For each transcript, overlapping reads from the provided bam file are collected. The counts within the transcript are then mapped onto the meta-gene structure and counts are aggregated over all transcripts in the gtf file.

Bam files need to be sorted by coordinate and indexed.

A meta-gene structure has two components - regions of variable size, such as exons, introns, etc, which nevertheless have a fixed start and end coordinate in a transcript. The other component are regions of fixed width, such a regions of a certain size upstream or downstream of a landmark such as a transcription start site.

The size of the former class, regions of variable size, can be varied with --resolution options. For example, the option --resolution-upstream-utr=1000 will create a meta-gene with a 1000bp upstream UTR region. UTRs that are larger will be compressed, and UTRs that are smaller, will be stretched to fit the 1000bp meta-gene UTR region.

The size of fixed-width regions can be set with --extension options. For example, the options --extension-upstream will set the size of the uptsream extension region to 1000bp. Note that no scaling is required when counting reads towards the fixed-width meta-gene profile.

Type:

python bam2geneprofile.py --help

for command line help.

Options

The script provides a variety of different meta-gene structures i.e. geneprofiles, selectable via using the option: (--method).

Profiles

Different profiles are accessible through the --method option. Multiple methods can be applied at the same time. While upstream and downstream typically have a fixed size, the other regions such as CDS, UTR will be scaled to a common size.

utrprofile

UPSTREAM - UTR5 - CDS - UTR3 - DOWNSTREAM gene models with UTR. Separate the coding section from the non-coding part.

geneprofile

UPSTREAM - EXON - DOWNSTREAM simple exonic gene models

geneprofilewithintrons

UPSTREAM - EXON - INTRON - DOWNSTREAM

gene models containing also intronic sequence, only correct if used with --use-base-accuracy option.

separateexonprofile

UPSTREAM - FIRST EXON - EXON - LAST EXON - DOWNSTREAM

gene models with the first and last exons separated out from all other exons. Only applicable to gene models with > 1 exons.

separateexonprofilewithintrons

UPSTREAM - FIRST EXON - EXON - INTRON - LAST EXON - DOWNSTREAM

gene models with first and last exons separated out, and includes all introns together. Excludes genes with < 2 exons and no introns.

geneprofileabsolutedistancefromthreeprimeend

UPSTREAM - EXON (absolute distance, see below) - INTRON (absolute distance, see below) - DOWNSTREAM (the downstream of the exons) region, the script counts over the mRNA transcript only, skipping introns. Designed to visualize the 3 prime bias in RNASeq data, only correct if used together with --use-base-accuracy option.

absolute distance: In order to to visualize the 3 prime bias, genes are not supposed to be streched to equal length as it did in all other counting methods. In this counting method, we first set a fix length using --extension-exons-absolute-distance-topolya, the script will discard genes shorter than this fixed length. For genes (when all the exons stitched together) longer than this fixed length, the script will only count over this fixed length ( a absolute distance ) from three prime end, instead of compress the longer genes. Same goes for absolute distance intron counting.

tssprofile

UPSTREAM - DOWNSTREAM transcription start/stop sites

intervalprofile

UPSTREAM - INTERVAL - DOWNSTREAM Similar to geneprofile, but count over the complete span of the gene (including introns).

midpointprofile

UPSTREAM - DOWNSTREAM aggregate over midpoint of gene model

Normalization

Normalization can be applied in two stages of the computation.

Count vector normalization

Before adding counts to the meta-gene profile, the profile for the individual transcript can be normalized. Without normalization, highly expressed genes will contribute more to the meta-gene profile than lowly expressed genes. Normalization can assure that each gene contributes an equal amount.

Normalization is applied to the vector of read counts that is computed for each transcript. Normalization can be applied for the whole transcript (total) or on a per segment basis depending on the counter. For example, in the gene counter, exons, upstream and downstream segments can be normalized independently.

Counts can be normalized either by the maximum or the sum of all counts in a segment or across the whole transcript. Normalization is controlled with the command line option --normalize-trancript. Its arguments are:

  • none: no normalization

  • sum: sum of counts within a region (exons, upstream, …). The area under the curve will sum to 1 for each region.

  • max: maximum count within a region (exons,upstream, …).

  • total-sum: sum of counts across all regions. The area under the curve will sum to 1 for the complete transcript.

  • total-max: maximum count across all regions.

The options above control the contribution of individual transcripts to a meta-gene profile and are thus suited for example for RNA-Seq data.

The options above do not control for different read-depths or any local biases. To compare meta-gene profiles between samples, additional normalization is required.

Meta-gene profile normalization

To enable comparison between experiments, the meta-gene profile itself can be normalized. Normalization a profile can help comparing the shapes of profiles between different experiments independent of the number of reads or transcripts used in the construction of the meta-gene profile.

Meta-gene profile normalization is controlled via the --normalize-profile option. Possible normalization are:

  • none: no normalization

  • area: normalize such that the area under the meta-gene profile is 1.

  • counts: normalize by number of features (genes,tss) that have been counted.

  • background: normalize with background (see below).

A special normalization is activated with the background option. Here, the counts at the left and right most regions are used to estimate a background level for each transcript. The counts are then divided by this background-level. The assumption is that the meta-gene model is computed over a large enough area to include genomic background.

Genes versus transcripts

The default is to collect reads on a per-transcript level. Alternatively, the script can merge all transcripts of a gene into a single virtual transcript. Note that this virtual transcript might not be a biologically plausible transcript. It is usually better to provide bam2geneprofile.py with a set of representative transcripts per gene in order to avoid up-weighting genes with multiple transcripts.

Control

If control files (chip-seq input tracks) are supplied, counts in the control file can be used to compute a fold-change.

Bed and wiggle files

The densities can be computed from bed or wiggle formatted files. If a bed formatted file is supplied, it must be compressed with and indexed with tabix.

Note

Paired-endedness is ignored. Both ends of a paired-ended read are treated individually.

Command line options

usage: bam2geneprofile [-h] [--version]
                       [-m {geneprofile,tssprofile,utrprofile,intervalprofile,midpointprofile,geneprofilewithintrons,geneprofileabsolutedistancefromthreeprimeend,separateexonprofile,separateexonprofilewithintrons}]
                       [-b BAM] [-c BAM] [-g GTF]
                       [--normalize-transcript {none,max,sum,total-max,total-sum}]
                       [--normalize-profile {all,none,area,counts,background}]
                       [-r {gene,transcript}] [-i SHIFTS] [-a] [-u]
                       [-e EXTENDS]
                       [--resolution-upstream RESOLUTION_UPSTREAM]
                       [--resolution-downstream RESOLUTION_DOWNSTREAM]
                       [--resolution-upstream-utr RESOLUTION_UPSTREAM_UTR]
                       [--resolution-downstream-utr RESOLUTION_DOWNSTREAM_UTR]
                       [--resolution-cds RESOLUTION_CDS]
                       [--resolution-first-exon RESOLUTION_FIRST]
                       [--resolution-last-exon RESOLUTION_LAST]
                       [--resolution-introns RESOLUTION_INTRONS]
                       [--resolution-exons-absolute-distance-topolya RESOLUTION_EXONS_ABSOLUTE_DISTANCE_TOPOLYA]
                       [--resolution-introns-absolute-distance-topolya RESOLUTION_INTRONS_ABSOLUTE_DISTANCE_TOPOLYA]
                       [--extension-exons-absolute-distance-topolya EXTENSION_EXONS_ABSOLUTE_DISTANCE_TOPOLYA]
                       [--extension-introns-absolute-distance-topolya EXTENSION_INTRONS_ABSOLUTE_DISTANCE_TOPOLYA]
                       [--extension-upstream EXTENSION_UPSTREAM]
                       [--extension-downstream EXTENSION_DOWNSTREAM]
                       [--extension-inward EXTENSION_INWARD]
                       [--extension-outward EXTENSION_OUTWARD]
                       [--scale-flank-length SCALE_FLANKS]
                       [--control-factor CONTROL_FACTOR]
                       [--output-all-profiles]
                       [--counts-tsv-file INPUT_FILENAME_COUNTS]
                       [--background-region-bins BACKGROUND_REGION_BINS]
                       [--output-res RESOLUTION_IMAGES]
                       [--image-format IMAGE_FORMAT] [--timeit TIMEIT_FILE]
                       [--timeit-name TIMEIT_NAME] [--timeit-header]
                       [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                       [--log-config-filename LOG_CONFIG_FILENAME]
                       [--tracing {function}] [-? ?]
                       [-P OUTPUT_FILENAME_PATTERN] [-F] [-I STDIN]
                       [-L STDLOG] [-E STDERR] [-S STDOUT]
bam2geneprofile: error: argument -?: expected one argument