bed2bed - manipulate bed files


This script provides various methods for merging (by position, by name or by score), filtering and moving bed formatted intervals and outputting the results as a bed file

This script provides several methods, each with a set of options
to control behavoir:
Merge together overlapping or adjacent intervals. The basic
functionality is similar to bedtools merge, but with some additions:
\* Merging by name: specifying the --merge-by-name option will mean

that only overlaping (or adjacent intervals) with the same value in the 4th column of the bed will be merged

\* Removing overlapping intervals with inconsistent names: set the

--remove-inconsistent-names option.

.. caution::

Intervals of the same name will only be merged if they are consecutive in the bed file.

\* Only output merged intervals: By specifiying the --merge-min-intervals=n

options, only those intervals that were created by merging at least n intervals together will be output

Intervals that are close but not overlapping can be merged by setting
--merge-distance to a non-zero value
Merges together overlapping or adjecent intervals only if they have
"similar" scores. Score similarity is assessed by creating a number of
score bins and assigning each interval to a bin. If two adjacent
intervals are in the same bin, the intervals are merged. Note that in
contrast to merge-by-name above, two intervals do not need to be
overlapping or within a certain distance to be merged.
There are several methods to create the bins:
\* equal-bases: Bins are created to that they contain the same number of bases.

Specified by passing “equal-bases” to –binning-method. This is the default.

\* equal-intervals: Score bins are create so that each bin contains the

same number of intervals. Specified by passing “equal-intervals” to –binning-method.

\* equal-range: Score bins are created so that

each bin covers the same fraction of the total range of scores. Specified by passing “equal-range” to –binning-method.

\* bin-edges: Score binds can be specified by manually passing a comma

seperated list of bin edges to –bin-edges.

The number of bins is specified by the --num-bins options, and the
default is 5.
Creates blocked bed12 outputs from a bed6, where intervals with the
same name are merged together to create a single bed12 entry.
.. Caution:: Input must be sorted so that entries of the same
name are together.
Removes intervals that are on unknown contigs or extend off the 3' or
5' end of the contig. Requires a tab seperated input file to -g which
lists the contigs in the genome, plus their lengths.
As above, but instead of removing intervals overlapping the ends of
contigs, truncates them. Also removes empty intervals.
Output intervals whose names are in list of desired names. Names are
supplied as a file with one name on each line.
Moves intervals by the specified amount, but will not allow them to be
shifted off the end of contigs. Thus if a shift will shift the start
of end of the contig, the interval is only moved as much as is
possible without doing this.
Renames chromosome names. Source and target names are supplied as a file
with two columns. Examples are available at:
Note that unmapped chromosomes are dropped from the output file.
Other options
-g/--genome-file, -b/--bam-file:

the filter-genome, sanitize-genome and shift methods require a genome in order to ensure they are not placing intervals outside the limits of contigs. This genome can be supplied either as a samtools or cgat indexed genome, or extracted from the header of a bam file.


Merge overlapping or adjectent peaks from a CHiP-seq experiment where the intervals have the same name:

cat chip-peaks.bed | cgat bed2bed –method=merge –merge-by-name > chip-peaks-merged.bed

Merge adjected ChIP-seq peaks if their scores are in the same quartile of all scores:

cat chip-peaks.bed | cgat bed2bed –method=bins –binning-method=equal-intervals –num-bins=4

Remove intervals that overlap the ends of a contig and those that are on a non-standard contig. Take the input intervals from a file rather than stdin. Note that hg19.fasta has been indexed with index_genome:

cgat bed2bed –method=filter-genome –genome-file=hg19.fasta -I chip-peaks.bed -O chip-peaks-sanitized.bed

Convert a bed file contain gene structures with one line per exon to a bed12 with linked block representing the gene structure. Note the transparent use of compressed input and output files:

cgat bed2bed –method=block -I transcripts.bed.gz -O transcripts.blocked.bed.gz

Rename UCSC chromosomes to ENSEMBL.

cat ucsc.bed | cgat bed2bed –method=rename-chr –rename-chr-file=ucsc2ensembl.txt > ensembl.bed


cgat bed2bed –method=[METHOD] [OPTIONS]

Will read bed file from stdin and apply the specified method

Command line options

usage: bed2bed [-h]
               [-m {merge,filter-genome,bins,block,sanitize-genome,shift,extend,filter-names,rename-chr}]
               [--num-bins NUM_BINS] [--bin-edges BIN_EDGES]
               [--binning-method {equal-bases,equal-intervals,equal-range}]
               [--merge-distance MERGE_DISTANCE]
               [--merge-min-intervals MERGE_MIN_INTERVALS] [--merge-by-name]
               [--merge-and-resolve-blocks] [--merge-stranded]
               [--remove-inconsistent-names] [--offset OFFSET]
               [-g GENOME_FILE] [-b BAM_FILE] [--filter-names-file NAMES]
               [--rename-chr-file RENAME_CHR_FILE] [--timeit TIMEIT_FILE]
               [--timeit-name TIMEIT_NAME] [--timeit-header]
               [--random-seed RANDOM_SEED] [-v LOGLEVEL]
               [--log-config-filename LOG_CONFIG_FILENAME]
               [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
               [-E STDERR] [-S STDOUT]
