fastq2fastq.py - manipulate fastq files

Tags

Genomics NGS Sequences FASTQ Manipulation

Purpose

This script performs manipulations on fastq formatted files. For example it can be used to change the quality score format or sample a subset of reads.

The script predominantly is used for manipulation of single fastq files. However, for some of its functionality it will take paired data using the --pair-fastq-file and --output-filename-pattern options. This applies to the sample and sort methods.

Usage

Example::

In this example we randomly sample 50% of reads from paired data provided in two fastq files.

head in.fastq.1

@SRR111956.1 HWUSI-EAS618:7:1:27:1582 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR111956.1 HWUSI-EAS618:7:1:27:1582 length=36 =@A@9@BAB@;@BABA?=;@@BB<A@9@;@2>@;?? @SRR111956.2 HWUSI-EAS618:7:1:29:1664 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCC +SRR111956.2 HWUSI-EAS618:7:1:29:1664 length=36 =B@9@0>A<B=B=AAA?;*(@A>(@<=*9=9@BA>7 @SRR111956.3 HWUSI-EAS618:7:1:38:878 length=36 AGTGAGCAGGGAAACAATGTCTGTCTAAGAATTTGA

head in.fastq.2

+SRR111956.3 HWUSI-EAS618:7:1:38:878 length=36 <?@BA?;A=@BA>;@@7################### @SRR111956.4 HWUSI-EAS618:7:1:38:1783 length=36 ATTAGTATTATCCATTTATATAATCAATAAAAATGT +SRR111956.4 HWUSI-EAS618:7:1:38:1783 length=36 ?ABBA2CCBBB2?=BB@C>=AAC@A=CBB####### @SRR111956.5 HWUSI-EAS618:7:1:39:1305 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR111956.5 HWUSI-EAS618:7:1:39:1305 length=36 AA>5;A>*91?=AAA@@BBA<B=?ABA>2>?A<BB@

command-line::
cat in.fastq.1 | python fastq2fastq.py

–method=sample –sample-size 0.5 –pair-fastq-file in.fastq.2 –output-filename-pattern out.fastq.2 > out.fastq.1

head out.fastq.1 @SRR111956.1 HWUSI-EAS618:7:1:27:1582 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC + =@A@9@BAB@;@BABA?=;@@BB<A@9@;@2>@;?? @SRR111956.2 HWUSI-EAS618:7:1:29:1664 length=36 CCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCC + =B@9@0>A<B=B=AAA?;*(@A>(@<=*9=9@BA>7 @SRR111956.3 HWUSI-EAS618:7:1:38:878 length=36 AGTGAGCAGGGAAACAATGTCTGTCTAAGAATTTGA + <?@BA?;A=@BA>;@@7################### @SRR111956.4 HWUSI-EAS618:7:1:38:1783 length=36 ATTAGTATTATCCATTTATATAATCAATAAAAATGT + ?ABBA2CCBBB2?=BB@C>=AAC@A=CBB#######

Options

The following methods are implemented (--method).

change-format

change the quality format to new format given as target-format. Options are sanger,

solexa, phred64, integer and illumina-1.8

sample

Sub-sample a fastq file. The size of the sample is set by –sample-size

unique

Remove duplicate reads based on read name

trim3

Trim a fixed number of nucleotides from the 3’ end of reads. (see --num-bases). Note that there are better tools for

trimming.

trim5

Trim a fixed number of nucleotides from the 5’ end of reads. (see --num-bases). Note that there are better tools for

trimming.

sort

Sort the fastq file by read name.

renumber-reads

Rename the reads based on pattern given in --pattern-identifier e.g. --pattern-identifier="read_%010i"

Type:

python fastq2fastq.py --help

for command line help.

Command line options

usage: fastq2fastq [-h] [--version] [-i INPUT_FASTQ_FILE]
                   [--output-removed-tsv OUTPUT_REMOVED_TSV]
                   [--output-stats-tsv OUTPUT_STATS_TSV]
                   [--output-removed-fastq OUTPUT_REMOVED_FASTQ]
                   [-m {filter-N,filter-identifier,filter-ONT,offset-quality,apply,change-format,renumber-reads,sample,sort,trim3,trim5,unique,reverse-complement,grep}]
                   [--set-prefix SET_PREFIX]
                   [--input-filter-tsv INPUT_FILTER_TSV]
                   [--min-average-quality MIN_AVERAGE_QUALITY]
                   [--min-sequence-length MIN_SEQUENCE_LENGTH]
                   [--quality-offset QUALITY_OFFSET]
                   [--target-format {sanger,solexa,phred64,integer,illumina-1.8}]
                   [--guess-format {sanger,solexa,phred64,integer,illumina-1.8}]
                   [--sample-size SAMPLE_SIZE] [--pair-fastq-file PAIR]
                   [--map-tsv-file MAP_TSV_FILE] [--num-bases NBASES]
                   [--seed SEED] [--pattern-identifier RENUMBER_PATTERN]
                   [--grep-pattern GREP_PATTERN] [--timeit TIMEIT_FILE]
                   [--timeit-name TIMEIT_NAME] [--timeit-header]
                   [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                   [--log-config-filename LOG_CONFIG_FILENAME]
                   [--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
                   [-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
fastq2fastq: error: argument -?: expected one argument