Sra.py - Methods for dealing with short read archive files

Utility functions for dealing with SRA formatted files from the Short Read Archive.

Requirements: * fastq-dump >= 2.1.7

Code

Sra.peek(sra, outdir=None)

return the full file names for all files which will be extracted

Parameters

outdir (path) – perform extraction in outdir. If outdir is None, the extraction will take place in a temporary directory, which will be deleted afterwards.

Returns

  • files (list) – A list of fastq formatted files that are contained in the archive.

  • format (string) – The quality score format in the fastq formatted files.

Sra.extract(sra, outdir, tool='fastq-dump')

return statement for extracting the SRA file in outdir. possible tools are fastq-dump and abi-dump. Use abi-dump for colorspace

Sra.prefetch(sra)

Use prefetch from the SRA toolkit to download the local cache

Sra.clean_cache(sra)

Remove the specified SRA file from the cache.

Sra.fetch_ENA(dl_path, outdir, protocol='ascp')

Fetch fastq from ENA given accession

Sra.fetch_ENA_files(accession)

Get the names of the files matching the ENA accession

Sra.fetch_TCGA_fastq(acc, filename, token=None, outdir='.')

Get Fastq file from TCGA repository. Because of the nature of the TCGA repository it assumes certain things:

  1. That data is paired-end fastq

  2. That the files end in _1.fastq or _2.fastq

Sra.fetch_TCGA_BAM(acc, token, outdir='.', filter_bed=None)

Get BAM file from TCGA repository based on UUID. Will return statement and path/filename of downloaded file. A bed file may be provided to filter to remove contigs not present in the reference genome

Sra.process_remote_BAM(infile, token=None, outdir='.', filter_bed=None)

generate statement from .remote file