bam2bam.py - modify bam files¶
Purpose¶
This script reads a bam formatted file from stdin, performs an action (see methods below) then outputs a modified bam formatted file on stdout.
Note
You need to redirect logging information to a file (via -L) or turn it off via -v 0 in order to get a valid sam/bam file.
Documentation¶
The script implements the following methods:
set-nh
set the NH flag. Some tools (bowtie, bwa) do not set the NH flag. If set, this option will set the NH flag (for mapped reads). This option requires the bam/sam file to be sorted by read name.
unset-unmapped_mapq
some tools set the mapping quality of unmapped reads. This causes a violation in the Picard tools.
filter
remove alignments based on a variety of flags. The filtering method is determined by the
--filter-method
option. These may beunique
,non-unique
,mapped
,NM
orCM
. Ifunique
is set, only uniquely mapping reads will be output. Ifnon-unique
is set then only multi-mapping reads will be output. This method first checks for the NH flag - if set, a unique match should have at most NH=1 hits. If not set, the method checks for BWA flags. Currently it checks if X0 is set (X0=Number of best hits found by BWA). Ifmapped
is given, unmapped reads will be removed. IfNM
orCM
is set, the alignment of reads in two sam files (input and reference) is compared and only reads with a lower number of mismatches in the input compared to the reference sam file will be kept. IfCM
is set, the colourspace mismatch tag (for ABI Solid reads) will be used to count differences to the reference sam file. By default, theNM
(number of mismatches) tag is used. The tag that is used needs to present in both input sam file and the reference sam file. Ifunique
is given this wil NOT remove any unmapped reads. This can be achieved by providing thefilter
option twice, once each withmapped
andunique
.Note
The filter methods can’t currently combined with any of the other methods - this is work in progress.
strip-sequence
remove the sequence from all reads in a bam-file. Note that stripping the sequence will also remove the quality scores. Stripping is not reversible if the read names are not unique.
strip-quality
remove the quality scores from all reads in a bam-file. Stripping is not reversible if the read names are not unique.
set-sequence
set the sequence and quality scores in the bam file to some dummy values (‘A’ for sequence, ‘F’ for quality which is a valid score in most fastq encodings. Necessary for some tools that can not work with bam-files without sequence.
unstrip
add sequence and quality scores back to a bam file. Requires a fastq formatted file with the sequences and quality scores to insert.
unset-unmapped-mapq
sets the mapping quality of unmapped reads to 0.
keep-first-base
keep only the first base of reads so that read counting tools will only consider the first base in the counts
downsample-single
downsample-paired
add-sequence-error
add a certain amount of random error to read sequences. This method picks a certain proportion of positions within a read’s sequence and alters the nucleotide to a randomly chosen alternative. The model is naive and applies uniform probabilities for positions and nucleotides. The method does not update base qualities, the alignment and the NM flag. As a result, error rates that are computed via the NM flag will be unaffected. The error rate is set by –error-rate.
By default, the script works from stdin and outputs to stdout.
Usage¶
For example:
cgat bam2bam --method=filter --filter-method=mapped < in.bam > out.bam
will remove all unmapped reads from the bam-file.
Example for running downsample:
cgat bam2bam –method=downsample-paired –downsample=30000 –randomseed=1 -L out.log < Paired.bam > out.bam
Type:
cgat bam2bam --help
for command line help.
Command line options¶
usage: bam2bam [-h] [--version]
[-m {filter,keep-first-base,set-nh,set-sequence,strip-sequence,strip-quality,unstrip,unset-unmapped-mapq,downsample-single,downsample-paired,add-sequence-error}]
[--strip-method {all,match}]
[--filter-method {NM,CM,mapped,unique,non-unique,remove-list,keep-list,error-rate,min-read-length,min-average-base-quality}]
[--reference-bam-file REFERENCE_BAM] [--force-output]
[--output-sam] [--first-fastq-file FASTQ_PAIR1]
[--second-fastq-file FASTQ_PAIR2] [--downsample DOWNSAMPLE]
[--filename-read-list FILENAME_READ_LIST]
[--error-rate ERROR_RATE]
[--minimum-read-length MINIMUM_READ_LENGTH]
[--minimum-average-base-quality MINIMUM_AVERAGE_BASE_QUALITY]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
bam2bam: error: argument -?: expected one argument