compute stats from a bam-file

Purpose

This script takes a bam file as input and computes a few metrics by iterating over the file. The metrics output are:

Category

Content

total

total number of alignments in bam file

alignments_mapped

alignments mapped to a chromosome (bam flag)

alignments_unmapped

alignments unmapped (bam flag)

qc_fail

alignments failing QC (bam flag)

mate_unmapped

alignments in which the mate is unmapped (bam flag)

reverse

alignments in which read maps to reverse strand (bam flag)

mate_reverse

alignments in which mate maps to reverse strand (bam flag)

proper_pair

alignments in which both pairs have been mapped properly (according to the mapper) (bam flag)

read1

alignments for 1st read of pair (bam flag)

paired

alignments of reads that are paired (bam flag)

duplicate

read is PCR or optical duplicate (bam flag)

read2

alignment is for 2nd read of pair (bam flag)

secondary

alignment is not primary alignment

alignments_duplicates

number of alignments mapping to the same location

alignments_unique

number of alignments mapping to unique locations

reads_total

number of reads in file. Either given via –num-reads or deduc ed as the sum of mappend and unmapped reads

reads_mapped

number of reads mapping in file. Derived from the total number o f alignments and removing counts for multiple matches. Requires the NH flag to be set correctly.

reads_unmapped

number of reads unmapped in file. Assumes that there is only one entry per unmapped read.

reads_missing

number of reads missing, if number of reads given by –input-rea ds. Otherwise 0.

pairs_total

number of total pairs - this is the number of reads_total divided by two. If there were no pairs, pairs_total will be 0.

pairs_mapped

number of mapped pairs - this is the same as the number of proper pairs.

Additionally, the script outputs histograms for the following tags and scores.

  • NM: number of mismatches in alignments.

  • NH: number of hits of reads.

  • mapq: mapping quality of alignments.

Supplying a fastq file

If a fastq file is supplied (--fastq-file), the script will compute some additional summary statistics. However, as it builds a dictionary of all sequences, it will also require a good amount of memory. The additional metrics output are:

Category

Content

pairs_total

total number of pairs in input data

pairs_mapped

pairs in which both reads map

pairs_unmapped

pairs in which neither read maps

pairs_proper_unique

pairs which are proper and map uniquely.

pairs_incomplete_unique

pairs in which one of the reads maps uniquely, but the other does not map.

pairs_incomplete_multimapping

pairs in which one of the reads maps uniquely, but the other maps to multiple locations.

pairs_proper_duplicate

pairs which are proper and unique, but marked as duplicates.

pairs_proper_multimapping

pairs which are proper, but map to multiple locations.

pairs_not_proper_unique

pairs mapping uniquely, but not flagged as proper

pairs_other

pairs not in any of the above categories

Note that for paired-end data, any  or /2 suffixes will be removed from the read name in the assumption that these have been removed in the bam file as well.

Usage

Example:

python bam2stats.py in.bam

This command will generate various statistics based on the supplied BAM file, such as percentage reads mapped and percentage reads mapped in pairs. The output looks like this:

category

counts

percent

of

alignments_total

32018

100.00

alignments_total

alignments_mapped

32018

100.00

alignments_total

alignments_unmapped

0

0.00

alignments_total

alignments_qc_fail

0

0.00

alignments_mapped

alignments_mate_unmapped

241

0.75

alignments_mapped

alignments_reverse

16016

50.02

alignments_mapped

alignments_mate_reverse

15893

49.64

alignments_mapped

alignments_proper_pair

30865

96.40

alignments_mapped

alignments_read1

16057

50.15

alignments_mapped

alignments_paired

32018

100.00

alignments_mapped

alignments_duplicate

0

0.00

alignments_mapped

alignments_read2

15961

49.85

alignments_mapped

alignments_secondary

0

0.00

alignments_mapped

alignments_filtered

31950

99.79

alignments_mapped

reads_total

34250

100.00

reads_total

reads_unmapped

0

0.00

reads_total

reads_mapped

32018

93.48

reads_total

reads_missing

2232

6.52

reads_total

reads_mapped_unique

32018

100.00

reads_mapped

reads_multimapping

0

0.00

reads_mapped

pairs_total

17125

100.00

pairs_total

pairs_mapped

17125

100.00

pairs_total

pairs_unmapped

0

0.00

pairs_total

pairs_proper_unique

14880

86.89

pairs_total

pairs_incomplete_unique

2232

13.03

pairs_total

pairs_incomplete_multimapping

0

0.00

pairs_total

pairs_proper_duplicate

0

0.00

pairs_total

pairs_proper_multimapping

0

0.00

pairs_total

pairs_not_proper_unique

13

0.08

pairs_total

pairs_other

0

0.00

pairs_total

read1_total

17125

100.00

read1_total

read1_unmapped

0

0.00

read1_total

read1_mapped

16057

93.76

read1_total

read1_mapped_unique

16057

100.00

read1_mapped

reads_multimapping

0

0.00

read1_mapped

read1_missing

1068

6.65

read1_total

read2_total

17125

100.00

read2_total

read2_unmapped

0

0.00

read2_total

read2_mapped

15961

93.20

read2_total

read2_mapped_unique

15961

100.00

read2_mapped

reads_multimapping

0

0.00

read2_mapped

read2_missing

1164

7.29

read2_total

The first column contains the caterogy, the second the number of counts and the third a percentage. The fourth column denotes the denomiminator that was used to compute the percentage. In the table above, wee see that 16,057 first reads in a pair map and 15,961 second reads in pair map, resulting in 14,880 proper uniquely mapped pairs.

Type:

cgat bam2stats --help

for command line help.

Bam2stats can read from standard input:

cat in.bam | python bam2stats.py -

Documentation

Reads are not counted via read name, but making use of NH and HI flags when present. To recap, NH is the number of reported alignments that contain the query in the current record, while HI is the hit index and ranges from 0 to NH-1.

Unfortunately, not all aligners follow this convention. For example, gsnap seems to set NH to the number of reportable alignments, while the actual number of reported alignments in the file is less. Thus, if the HI flag is present, the maximum HI is used to correct the NH flag. The assumption is, that the same reporting threshold has been used for all alignments.

If no NH flag is present, it is assumed that all reads have only been reported once.

Multi-matching counts after filtering are really guesswork. Basically, the assumption is that filtering is consistent and will tend to remove all alignments of a query.

The error rates are computed using the following key:

substitution_rate

Number of mismatches divided by number of aligned bases. This is the same as the samtools stats error rate.

insertion_rate

Number of deletions in the read/insertions in the reference divided by the number of aligned bases.

deletion_rate

Number of insertions in the read/deletions in the reference divided by the number of aligned bases.

error_rate

Number of mismatches and deletions in the read divided by the number of aligned bases.

coverage

Percentage of bases aligned divided by read length.

The following graphic illustrates the computation. A . signifies a position that is included in the metric with X being an error:

AAAAACAAAA AAAAAAAA   Reference
 AAAAAAAAAAAA AAA     Read
 ....X.... ......     substitution_rate NM / (CMATCH + CINS)
 .........X.. ...     insertion_rate CINS / (CMATCH + CINS)
 ......... ..X...     deletion_rate CDEL / (CMATCH + CDEL)
 ....X.... ..X...     error_rate NM / (CMATCH + CINS) (corresponds to samtools stats)
 .........X.. ...     match_rate CMATCH / (CMATCH + CINS)
 ....X.... .. ...     mismatch_rate NM / (CMATCH) (1 - percent_identity/100)

With CINS: Insertion into the reference (consumes read, but not reference) and CDEL=Deletion from the reference (consumes reference, but not read).

Command line options

usage: bam2stats [-h] [--version] [-r GFF] [-f] [-i INPUT_READS] [-d]
                 [--output-readmap] [--add-alignment-details]
                 [-q FILENAME_FASTQ] [--basic-counts] [--timeit TIMEIT_FILE]
                 [--timeit-name TIMEIT_NAME] [--timeit-header]
                 [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                 [--log-config-filename LOG_CONFIG_FILENAME]
                 [--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
                 [-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
bam2stats: error: argument -?: expected one argument