diff_fasta.py - compare contents of two fasta files

Tags

Genomics Sequences FASTA Comparison

Purpose

This script takes two sets of fasta sequences and matches the identifiers. It then compares the sequences with the same identifiers and, depending on the output options selected, outputs

  • which sequences are missing

  • which sequences are identical

  • which sequences are prefixes/suffixes of each other

An explanatory field is appended to output sequence identifiers. An explanation of the different field values is provided in the log.

Options

-s, --correct-gap-shift

This option will correct shifts in alignment gaps between two sequences being compared

-1, --pattern1

regular expression pattern to extract identifier from in sequence 1

-2, --pattern2

regular expression pattern to extract identifier from in sequence 2

Depending on the option --output-section the following are output:

diff

identifiers of sequences that are different

seqdiff

identifiers of sequences that are different plus sequence

missed

identifiers of seqences that are missing from one set or the other

This script is of specialized interest and has been used in the past to check if ENSEMBL gene models had been correctly mapped into a database schema.

Usage

Example:

cat a.fasta | head

>ENSACAP00000004922
MRSRNQGGESSSSGKFSKSKPIINTGENQNLQEDAKKKNKSSRKEE ...
>ENSACAP00000005213
EEEEDESNNSYLYQPLNQDPDQGPAAVEETAPSTEPALDINERLQA ...
>ENSACAP00000018122
LIRSSSMFHIMKHGHYISRFGSKPGLKCIGMHENGIIFNNNPALWK ...

python diff_fasta.py --output-section=missed --output-section=seqdiff a.fasta b.fasta

cat diff.out

# Legend:
# seqs1:          number of sequences in set 1
# seqs2:          number of sequences in set 2
# same:           number of identical sequences
# diff:           number of sequences with differences
# nmissed1:       sequences in set 1 that are not found in set 2
# nmissed2:       sequences in set 2 that are not found in set 1
# Type of sequence differences
# first:          only the first residue is different
# last:           only the last residue is different
# prefix:         one sequence is prefix of the other
# selenocysteine: difference due to selenocysteines
# masked:         difference due to masked residues
# fixed:          fixed differences
# other:          other differences

Type:

python diff_fasta.py --help

for command line help.

Command line options

usage: diff-fasta [-h] [--version] [-s] [-1 PATTERN1] [-2 PATTERN2]
                  [-o {diff,missed,seqdiff}] [--timeit TIMEIT_FILE]
                  [--timeit-name TIMEIT_NAME] [--timeit-header]
                  [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                  [--log-config-filename LOG_CONFIG_FILENAME]
                  [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                  [-E STDERR] [-S STDOUT]
diff-fasta: error: argument -?: expected one argument