diff_fasta.py - compare contents of two fasta files¶

Tags: Genomics Sequences FASTA Comparison

Purpose¶

This script takes two sets of fasta sequences and matches the identifiers. It then compares the sequences with the same identifiers and, depending on the output options selected, outputs

which sequences are missing

which sequences are identical

which sequences are prefixes/suffixes of each other

An explanatory field is appended to output sequence identifiers. An explanation of the different field values is provided in the log.

Options¶

-s, --correct-gap-shift: This option will correct shifts in alignment gaps between two sequences being compared
-1, --pattern1: regular expression pattern to extract identifier from in sequence 1
-2, --pattern2: regular expression pattern to extract identifier from in sequence 2

Depending on the option --output-section the following are output:

diff
identifiers of sequences that are different

seqdiff
identifiers of sequences that are different plus sequence

missed
identifiers of seqences that are missing from one set or the other

This script is of specialized interest and has been used in the past to check if ENSEMBL gene models had been correctly mapped into a database schema.

Usage¶

Example:

cat a.fasta | head

>ENSACAP00000004922
MRSRNQGGESSSSGKFSKSKPIINTGENQNLQEDAKKKNKSSRKEE ...
>ENSACAP00000005213
EEEEDESNNSYLYQPLNQDPDQGPAAVEETAPSTEPALDINERLQA ...
>ENSACAP00000018122
LIRSSSMFHIMKHGHYISRFGSKPGLKCIGMHENGIIFNNNPALWK ...

python diff_fasta.py --output-section=missed --output-section=seqdiff a.fasta b.fasta

cat diff.out

# Legend:
# seqs1:          number of sequences in set 1
# seqs2:          number of sequences in set 2
# same:           number of identical sequences
# diff:           number of sequences with differences
# nmissed1:       sequences in set 1 that are not found in set 2
# nmissed2:       sequences in set 2 that are not found in set 1
# Type of sequence differences
# first:          only the first residue is different
# last:           only the last residue is different
# prefix:         one sequence is prefix of the other
# selenocysteine: difference due to selenocysteines
# masked:         difference due to masked residues
# fixed:          fixed differences
# other:          other differences

Type:

python diff_fasta.py --help

for command line help.

Command line options¶

usage: diff-fasta [-h] [--version] [-s] [-1 PATTERN1] [-2 PATTERN2]
                  [-o {diff,missed,seqdiff}] [--timeit TIMEIT_FILE]
                  [--timeit-name TIMEIT_NAME] [--timeit-header]
                  [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                  [--log-config-filename LOG_CONFIG_FILENAME]
                  [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                  [-E STDERR] [-S STDOUT]
diff-fasta: error: argument -?: expected one argument