Contributing to CGAT code¶
We encourage everyone who uses parts of the CGAT code collection to contribute. Contributions can take many forms: bugreports, bugfixes, new scripts and pipelines, documentation, tests, etc. All contributions are welcome.
Checklist for new scripts/modules¶
Before adding a new scripts to the repository, please check if the following are true:
The script performs a non-trivial task. If a one-line command line entry using standard unix commands can give the same effect, avoid adding a script to the repository.
The script has a clear purpose. Scripts should follow the unix philosophy. They should concentrate on one task and do it well. Ideally, the major input and output can be read from and written to standard input and standard output, respectively.
The script follows the naming convention of cgat.tools.
The scripts follows the Style Guide.
The script implements the
-h/--helpoptions. Ideally, the script has been derived from
The script can be imported. Ideally, it imports without performing any actions or writing output.
The script is well documented and the documentation has been added to the CGAT documentation. There should be an entry in
doc/scripts.rstand a file
The script has at least one test case added to
tests- and the test works (see Testing).
To add an extension, the following needs to be in place:
The main script (
scripts/bam2stats.py). The important lines in this script are:
try: import pyximport pyximport.install() import _bam2stats except ImportError: import CGAT._bam2stats as _bam2stats
The snippet first attempts to build and import the extension by setting up pyximport and then importing the cython module as
_bam2stats. In case this fails, as is the case for an installed code, it looks for a pre-built extension (by
setup.py) in the CGAT pacakge.
The cython implementation
_bam2stats.pyx. This script imports the pysam API via:
from csamtools cimport *
This statement imports, amongst others,
AlignedReadinto the namespace. Speed can be gained from declaring variables. For example, to efficiently iterate over a file, an
AlignedReadobject is declared:
# loop over samfile cdef AlignedRead read for read in samfile: ...
pyxbldproviding pyximport with build information. Required are the locations of the samtools and pysam header libraries of a source installation of pysam plus the
csamtools.soshared library. For example:
def make_ext(modname, pyxfilename): from distutils.extension import Extension import pysam, os dirname = os.path.dirname( pysam.__file__ )[:-len("pysam")] return Extension(name = modname, sources=[pyxfilename], extra_link_args=[ os.path.join( dirname, "csamtools.so")], include_dirs = pysam.get_include(), define_macros = pysam.get_defines() )
If the script
bam2stats.py is called the first time,
pyximport will compile the cython extension
and make it available to the script. Compilation requires a working
compiler and cython installation. Each time
is modified, a new compilation will take place.
Recipes are short use cases demonstrating the use of one or more CGAT utilities to address a specific problem.
Recipes should be written as ipython notebooks. The recipe notebooks
are stored in the
recipes directory in the repository. Each
recipe is within its individual directory. This minimizes
interference between each document, but also means that currently each
notebook needs a separate notebook server to be developped.
To build all recipes, type:
cd recipes make html make clean
This will build html files that are deposited in the docs directory.
The last cleaning up step is important in order to remove large files created during the notebook execution.
The commands above require the runipy python module. To install, type:
pip install runipy
Data for recipes can be made available in www.cgat.org/downloads/public/cgat/recipes. Ideally, recipes should make use of publicly available data sets such as ENCODE.
Attempt to add a plot to the end of a recipe, using R commands to create the plot within the notebook.
Best practice for CGAT pipelines:
All non-trivial code should be extracted to modules or scripts.
Modules should not access PARAMS dictionary directly, but parameters should be passed to the function.
Important processing steps where different external tools could potentially be employed the design of the module classes should be carefully considered to ensure consistent input and output file formats for different tools. PipelineMapping provides a good example for this.
All production pipelines should include tests for consistency which can be run automatically.
Where appropriate pipelines should include a small test dataset with published results for comparison. This dataset can be run on each pipeline run and included in the pipeline report where it can be used as a pipeline control.
Periodic code review meetings where interested parties can agree of major changes to production pipelines and associated modules – to be arranged as required.
The best way to manage pipeline improvements is by individuals using pipelines taking responsibility for incremental improvement. As best practice fellows should announce plans to modify particular pipelines and modules on the CGAT members list to avoid duplication of effort. Fellows should log the changes that they make in a change log and document both modules and pipelines in detail.
Add a section with Requirements to all pipeline scripts and tools. Only add them in files where the actual dependency arises, see <no title>.