diff_bam.py - compare multiple bam files against each other

Tags

Genomics NGS BAM Comparison

Purpose

Compare reads in multiple BAM files against each other.

Note

BAM files need to be sorted by read name. samtools sort does NOT work as it uses a custom comparison function (strnum_cmp) that is incompatible with the standard lexicographical order in python. See the example below on how to get sorted files.

This script is for validation purposes. It might take a while for large BAM files.

Usage

If you have two sorted sam or bam formatted files, type:

cgat diff_bam a.bam b.bam > out

If they are not sorted, you can use samtools sort to do an inplace sort:

cgat diff_bam <(samtools view -h a.bam | hsort 0 -k1,1)
               <(samtools view -h b.bam | hsort 0 -k1,1)

The samtools -h option outputs the header, and the hsort command sorts without disturbing the header.

An example output looks like this:

read

nlocations

nmatched

file1_nh

file2_nh

file1_loc

file2_loc

42YKVAAXX_HWI-EAS229_1:1:11:1659:174

1

2

2

2

0,0

0,0

42YKVAAXX_HWI-EAS229_1:1:11:166:1768

1

2

1

1

0

0

612UOAAXX_HWI-EAS229_1:1:97:147:1248

2

2

2

2

0,1

0,1

This reports for each read the number of locations that the read maps to in all files, the number of files that have matches found for the read. Then, for each file, it reports the number of matches and the locations it maps to (coded as integers, 0 the first location, 1 the second, …).

In the example above, the first read maps twice to 1 location in both files. This is a read occuring twice in the input file. The second read maps to the same one location in both files, while the third read maps to the two same locations in both input files.

Type:

python diff_bam.py --help

for command line help.

Documentation

For read counts to be correct the NH flag to be set correctly.

Command line options

usage: diff-bam [-h] [--version] [--header-names HEADERS]
                [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                [--log-config-filename LOG_CONFIG_FILENAME]
                [--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
                [-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
diff-bam: error: argument -?: expected one argument