Fastq.py - methods for dealing with fastq files

This module provides an iterator of fastq formatted files (iterate()). Additional iterators allow guessing of the quality score format (iterate_guess()) or converting them (iterate_convert()) while iterating through a file.

guessFormat() inspects a fastq file to guess the quality score format and getOffset() returns the numeric offset for quality score conversion for a particular quality score format.

Note

Another way to access the information in fastq formatted files is through pysam.

Reference

class Fastq.Record(identifier, seq, quals, format=None)

Bases: object

A record representing a fastq formatted record.

identifier

Sequence identifier

Type

string

seq

Sequence

Type

string

quals

String representation of quality scores.

Type

string

format

Quality score format. Can be one of sanger, illumina-1.8, solexa or phred64.

Type

string

guessFormat()

return quality score format - might return several if ambiguous.

guessDataType()

return the datatype. This is done by inspecting the sequence for basecalls/colorspace ints

trim(trim3, trim5=0)

remove nucleotides/quality scores from the 3’ and 5’ ends.

trim5(trim5=0)

remove nucleotides/quality scores from the 5’ ends.

toPhred()

return qualities as a list of phred-scores.

fromPhred(quals, format)

set qualities from a list of phred-scores.

Fastq.iterate(infile)

iterate over contents of fastq file.

Fastq.iterate_guess(infile, max_tries=10000, guess=None)

iterate over contents of fastq file.

Guess quality format by looking at the first max_tries entries and then subsequently setting the quality score format for each entry.

Parameters
  • infile (File) – File or file-like object to iterate over

  • max_tries (int) – Number of records to examine for guessing the quality score format.

  • guess (string) – Default format. This format will be chosen in the quality score format is ambiguous. The method checks if the guess is compatible with the records read so far.

Yields

fastq – An object of type Record.

Raises

ValueError – If the ranges of the fastq records are not compatible, are incompatible with guess or are ambiguous.

Fastq.iterate_convert(infile, format, max_tries=10000, guess=None)

iterate over contents of fastq file.

The quality score format is guessed and all subsequent records are converted to format.

Parameters
  • infile (File) – File or file-like object to iterate over

  • format (string) – Quality score format to convert all records into.

  • max_tries (int) – Number of records to examine for guessing the quality score format.

  • guess (string) – Default format. This format will be chosen in the quality score format is ambiguous. The method checks if the guess is compatible with the records read so far.

Yields

fastq – An object of type Record.

Raises

ValueError – If the ranges of the fastq records are not compatible, are incompatible with guess or are ambiguous.

Fastq.guessFormat(infile, max_lines=10000, raises=True)

guess format of FASTQ File.

Parameters
  • infile (File) – File or file-like object to iterate over

  • max_lines (int) – Number of lines to examine for guessing the quality score format.

  • raises (bool) – Raise ValueError if format is ambiguous

Returns

formats – list of quality score formats compatible with the file

Return type

list

Raises

ValueError – If the ranges of the fastq records are not compatible.

Fastq.guessDataType(infile, max_lines=10000, raises=True)

guess datatype of FASTQ File from [colourspace, basecalls]

Parameters
  • infile (File) –

  • or file-like object to iterate over (File) –

  • max_lines (int) – Number of lines to examine for guessing the datatype

  • raises (bool) – Raise ValueError if format is ambiguous

Returns

formats – list of datatypes compatible with the file (should only ever be one!)

Return type

list

Raises

ValueError – If the ranges of the fastq records are not compatible.

Fastq.getOffset(format, raises=True)

returns the ASCII offset for a certain format.

If raises is set a ValueError is raised if there is not a single offset. Otherwise, a minimum offset is returned.

Returns

offset – The quality score offset

Return type

int

Fastq.getReadLength(filename)

return readlength from a fastq file.

Only the first read is inspected. If there are different read lengths in the file, the result will be inaccurate.

Returns

read_length

Return type

int