CBioPortal.py - Interface with the Sloan-Kettering cBioPortal webservice

The Sloan Kettering cBioPortal webservice provides access to a database of results of genomics experiments on various cancers. The database is organised into studies, each study contains a number of case lists, where each list contains the ids of a set of patients, and genetic profiles, each of which represents an assay conducted on the patients in the case list as part of the study.

The main class here is the CBioPortal class representing a connection to the cBioPortal Database. Query’s are represented as methods of the class. Study ids or names or case lists can be provided to the constructor to the object, via the setDefaultStudy and setDefaultCaseList methods or to the indevidual query methods. Where ever possible the validity of parameters is checked before the query is executed.

Whenever a query requires a genetic profile id or a list of such ids, but none are given, the list of all profiles for which the show_in_analysis flag is set will be used.

All of the commands provided in the webservice are implemented here and as far as possible the name, syntax and paramter names of the query are identical to the raw commands to the webservice. These queries are:

  • getCancerStudies,

  • getCaseLists,

  • getProfileData,

  • getMutationData,

  • getClinicalData,

  • getProteinArrayInfo,

  • getProteinArrayData,

  • getLink,

  • getOncoprintHTML.

In addition two new queries are implememented that are not part of the webservice:

  • getPercentAltered and

  • getTotalAltered

These emulate the function of the website where the percent of cases that show any alteration for the gene and profiles given are returned (getPercentAltered, or the percent of cases that show an alteration in any of the genes (getTotalAltered) is returned.

examples:

gene_list = [ "TP53",
"BCL2",
"MYC"  ]
portal = CBioPortal()
portal.setDefaultStudy(study = "prad_mskcc")
portal.setDefaultCaseList(case_set_id = "prad_all_complete")
portal.getPercentAltered(gene_list = gene_list)

or more tersely:

portal.CBioProtal()
portal.getPercentAltered(study = "prad_mskcc", case_set_id = "prad_all_complete",
                         gene_list = ["TP53","BCL2","MYC"],
                         genetic_profile_id =["prad_mskcc_mrna"])

Any warnings returned by the query are stored in CBioPortal.last_warnings.

Query’s that would give too long an URL are split into smaller querys and the results combined transparently.

A commandline interface is provided for convenience, syntax:

python CBioPortal.py [options] command(s)

Reference

class CBioPortal.CBioPortal(url=None, study=None, study_name=None, case_list_id=None)

Bases: object

connect to the cBioPortal Database.

If no url is specified the default url is used. A list of of valid study ids is retrieved from the database. This both confirms that the datavase is reachable, and provides cached checking for the ids provided. If a study or study name is provided then this is set as the defualt study for this session and the details of the availible profiles and cases is retrieved. ‘Study’ is the study id. If both study and study_name are specified then the study id is used.

getCancerStudies()

Fetches the list of cancer studies currently in the database.

Returns list of dictionaries with three entries ‘cancer_study_id’,’name’ and ‘description’. Also caches this data to verify the validity of later calls

getGeneticProfiles(study=None, study_name=None)

Fetches the valid genetic profiles for a particular study.

study is the study id. If both study and study_name are specified, study is used. If neither study nor study name is specified then the default study is used if set, if not a value error is raised. Returns a list of dictionaries

getCaseLists(study=None, study_name=None)

Retrieves meta-data regarding all case lists stored about a specific cancer study.

For example, a within a particular study, only some cases may have sequence data, and another subset of cases may have been sequenced and treated with a specific therapeutic protocol. Multiple case lists may be associated with each cancer study, and this method enables you to retrieve meta-data regarding all of these case lists.

Data is returned as a list of dictionaries with the following entries:

  • case_list_id: a unique ID used to identify the case list ID in subsequent interface calls. This is a human readable ID. For example, “gbm_all” identifies all cases profiles in the TCGA GBM study.

  • case_list_name: short name for the case list.

  • case_list_description: short description of the case list.

  • cancer_study_id: cancer study ID tied to this genetic profile. Will match the input cancer_study_id.

  • case_ids: space delimited list of all case IDs that make up this case list.

getProfileData(gene_list, case_set_id=None, genetic_profile_id=None, study=None, study_name=None)

Retrieves genomic profile data for one or more genes.

You can specify one gene and many profiles or one profile and many genes. If you specify no genetic profiles then all genetic profiles for the specified or default study are used if the case_set_id is from that study otherwise a ValueError is raised.

Return value depends on the parameters. If you specify a single genetic profile and multiple genes a list of ordered dictionaries with the following entries:

gene_id: Entrez Gene ID
common: HUGO Gene Symbol
entries 3 - N: Data for each case

If you specify multi genetic profiles and a single gene, a list of ordered dictoraries with the following entries is returned:

genetic_profile_id: The Genetic Profile ID.
alteration_type: The Genetic Alteration Type, e.g. MUTATION, MUTATION_EXTENDED, COPY_NUMBER_ALTERATION, or MRNA_EXPRESSION.
gene_id: Entrez Gene ID.
common: HUGO Gene Symbol.
Columns 5 - N: Data for each case.
getMutationData(gene_list, genetic_profile_id, case_set_id=None, study=None, study_name=None)

For data of type EXTENDED_MUTATION, you can request the full set of annotated extended mutation data.

This enables you to, for example, determine which sequencing center sequenced the mutation, the amino acid change that results from the mutation, or gather links to predicted functional consequences of the mutation.

Query Format

case_set_id= [case set ID] (required) genetic_profile_id= [a single genetic profile IDs] (required). gene_list= [one or more genes, specified as HUGO Gene Symbols or

Entrez Gene IDs](required)

Response Format

A list of dictionaries with the following entires

entrez_gene_id: Entrez Gene ID. gene_symbol: HUGO Gene Symbol. case_id: Case ID. sequencing_center: Sequencer Center responsible for identifying

this mutation.

For example: broad.mit.edu.

mutation_status: somatic or germline mutation status. all mutations

returned will be of type somatic.

mutation_type: mutation type, such as nonsense, missense, or frameshift_ins. validation_status: validation status. Usually valid, invalid, or unknown. amino_acid_change: amino acid change resulting from the mutation.

functional_impact_score: predicted functional impact score,

as predicted by: Mutation Assessor.

xvar_link: Link to the Mutation Assessor web site. xvar_link_pdb: Link to the Protein Data Bank (PDB) View within

Mutation Assessor web site.

xvar_link_msa: Link the Multiple Sequence Alignment (MSA) view

within the Mutation Assessor web site.

chr: chromosome where mutation occurs. start_position: start position of mutation. end_position: end position of mutation.

If a default study is set then a check will be performed to set if the supplied case id is from the specified study. The study can be over written using the study and study_name parameters

getClinicalData(case_set_id=None, study=None, study_name=None)

Retrieves overall survival, disease free survival and age at diagnosis for specified cases.

Due to patient privacy restrictions, no other clinical data is available.

case_set_id= [case set ID] (required)

A list of dictionaries with the following entries:

case_id: Unique Case Identifier. overall_survival_months: Overall survival, in months. overall_survival_status: Overall survival status, usually

indicated as “LIVING” or “DECEASED”.

disease_free_survival_months: Disease free survival, in months. disease_free_survival_status: Disease free survival status,

usually indicated as “DiseaseFree” or “Recurred/Progressed”.

age_at_diagnosis: Age at diagnosis.

If a study is specified or a defualt study is set, then the case_set_id will be tested to check if it exists for that study.

getProteinArrayInfo(protein_array_type=None, gene_list=None, study=None, study_name=None)

Retrieves information on antibodies used by reverse-phase protein arrays (RPPA) to measure protein/phosphoprotein levels.

cancer_study_id= [cancer study ID] (required) protein_array_type= [protein_level or phosphorylation] gene_list= [one or more genes, specified as HUGO Gene Symbols or Entrez Gene IDs].

A list of dictionaries with the following entires:

ARRAY_ID: The protein array ID. ARRAY_TYPE: The protein array antibody type, i.e. protein_level

or phosphorylation.

GENE: The targeted gene name (HUGO gene symbol). RESIDUE: The targeted resdue(s).

If no study is specified the default study is used. If that is not specified an error is raised.

getProteinArrayData(protein_array_id=None, case_set_id=None, array_info=0, study=None, study_name=None)

Retrieves protein and/or phosphoprotein levels measured by reverse-phase protein arrays (RPPA).

case_set_id= [case set ID] protein_array_id= [one or more protein array IDs] as list. array_info= [1 or 0]. If 1, antibody information will also be exported.

If the parameter of array_info is not specified or it is not 1, returns a list of dictionaries with the following columns.

ARRAY_ID: The protein array ID. Columns 2 - N: Data for each case.

If the parameter of array_info is 1, you will receive a list of ordered dictionaries with the following entires:

ARRAY_ID: The protein array ID. ARRAY_TYPE: The protein array antibody type, i.e. protein_level or

phosphorylation.

GENE: The targeted gene name (HUGO gene symbol). RESIDUE: The targeted resdue(s). Columns 5 - N: Data for each case.

If the defualt study is set then the case_set_id will be check. The default study can be overidden using the study or study_name parameters.

return a perminant link to the cBioPortal report for the gene_list cancer_study_id=[cancer study ID] gene_list=[a comma separated list of HUGO gene symbols] (required) report=[report to display; can be one of: full (default), oncoprint_html]

getOncoprintHTML(gene_list, study=None, study_name=None)

returns the HTML for the oncoprint report for the specified gene list and study

setDefaultStudy(study=None, study_name=None)

sets a new study as the default study. Will check that the study id is valid

setDefaultCaseList(case_set_id, study=None, study_name=None)

set the default case list. If study is not specified the default study will be used.

The study will be used to check that the case_set exists.

getPercentAltered(gene_list, study=None, study_name=None, case_set_id=None, genetic_profile_id=None, threshold=2)

Get the percent of cases that have one or more of the specified alterations for each gene

study = [cancer_study_id] The study to use.

study_name = [cancer_study_name] The name of the study to

use. If neither this nor study are specified, then the default is used.

case_set_id = [case_set_id] The case list to use. If not

specified, the default case list is used.

gene_list = [one or more genes, specified as HUGO Gene Symobls or ENtrez Gene IDs] (require)

genetic_profile_id = [one or more genetic profile IDs] If none specified all genetic profiles for the specified study are used..

threhold = [z_score_threshold] the numeric threshold at which a mrna expression z-score is said to be significant.

A list of dictionaries with the following entries gene_id: The Entrez Gene ID common: The Hugo Gene Symbol altered_in: The percent of cases in which the gene is altered

One implementation note is that a guess must be made as to wether a returned profile value represents a alteration or not. Currently guesses are only made for copy number variation, mrna expression and mutionation

getTotalAltered(gene_list, study=None, study_name=None, case_set_id=None, genetic_profile_id=None, threshold=2)

Calculate the percent of cases in which any one of the specified genes are altered

exception CBioPortal.CDGSError(error, request)

Bases: Exception

exception that handles errors returned by querys in the database