metabolike.parser package

Submodules

metabolike.parser.bkms_react module

metabolike.parser.bkms_react.get_bkms_tarball(filepath, extract=True)

The file is available at: https://bkms.brenda-enzymes.org/download.php

The compressed file (Reactions_BKMS.tar.gz) includes the table in tab stop separated format (Excel, OpenOffice). The table contains actual data of BRENDA (release 2021.2, only reactions with naturally occuring substrates), MetaCyc (version 24.5), SABIO-RK (07/02/2021) and KEGG data, downloaded on the 23rd of April 2012. Downloading more recent KEGG data cannot be offered because a KEGG license agreement would be necessary.

Parameters
  • filepath (str) – The path to store the downloaded file.

  • extract (bool) – Extract the file.

Return type

None

metabolike.parser.bkms_react.read_bkms(filepath, clean=True)

Read the BKMS-react table and prepare it for further processing.

The table contains random ^M characters in some rows. These characters won’t break pandas, but they will make the parsed table unexpectedly long. The clean parameter can be used to remove these characters.

BRENDA takes EC numbers as identifiers, so we need entries with non-empty EC_Number and Reaction_ID_MetaCyc columns. There is a Reaction_ID_BRENDA column that’s sometimes non-empty when the EC number is missing, but the field is not documented, and it’s not clear how it can be mapped to an entry in the BRENDA text file.

To be extra conservative, we only keep entries with non-empty Reaction_ID_KEGG columns. Only reactions in MetaCyc/BioCyc that are associated with matching EC numbers and KEGG IDs will be annotated.

Parameters
  • filepath (str) – The path to the BKMS-react .tsv file.

  • clean (bool) – Remove the``^M`` characters.

Return type

A pandas dataframe

metabolike.parser.brenda module

Reading and parsing the BRENDA text file.

A lot of the code is translated from the brendaDb R package. The BRENDA text file is organised in an EC-number specific format. The information on each EC-number is given in a very short and compact way in a part of the file. The contents of each line are described by a two/three letter acronym at the beginning of the line and start after a TAB. Empty spaces at the beginning of a line indicate a continuation line.

The contents are organised in ~40 information fields as given below. Protein information is included in #...#, literature citations are in <...>, commentaries in (...) and field- special information in {...}.

It’s not officially documented, but some fields also have commentaries wrapped in |...|. These are usually for reaction-related fields, so (...) would be for substrates and |...| for products.

Protein information is given as the combination organism/Uniprot accession number where available. When this information is not given in the original paper only the organism is given.

/// indicates the end of an EC-number specific part.

metabolike.parser.brenda.parse_brenda(filepath, cache=False, ec_nums=None)

Parse the BRENDA text file into a dict.

This implmentation focuses on extracting information from the text file, and feeding the data into a Neo4j database. The parser is implemented using Lark. A series of lark.visitors.Transformer classes are used to clean the data and convert it into the format required by Neo4j.

Parameters
  • filepath (Union[str, Path]) – The path to the BRENDA text file.

  • cache (bool) – Whether to cache the parsed data to a parquet file.

  • ec_nums (Optional[Iterable[str]]) – A list of EC numbers to extract.

Return type

Dict[str, Any]

Returns

A dict with the description column from read_brenda() transformed into lists of dicts stored as values, and EC numbers as keys.

metabolike.parser.brenda_transformer module

class metabolike.parser.brenda_transformer.BaseTransformer(visit_tokens=True)

Bases: Transformer

NUM_ID(num)
TOKEN(tok)
protein_id(children)
Return type

Token

ref_id(children)
Return type

Token

class metabolike.parser.brenda_transformer.CommentaryOnlyTreeTransformer(visit_tokens=True)

Bases: BaseTransformer

description(children)
Return type

Token

entry(children)
Return type

Dict[str, Any]

class metabolike.parser.brenda_transformer.GenericTreeTransformer(visit_tokens=True)

Bases: BaseTransformer

Transform extracted values from bottom-up.

Formats the tree into a dictionary, as described in entry().

commentary(children)
Return type

Token

content(children)
Return type

Dict[str, Union[str, List[int]]]

description(children)
Return type

Token

entry(children)
Return type

Dict[str, Any]

class metabolike.parser.brenda_transformer.ReactionTreeTransformer(visit_tokens=True)

Bases: GenericTreeTransformer

Commentary in (...) are on substrates, and in |...| on products.

commentary(children)
Return type

Token

entry(children)
Return type

Dict[str, Any]

more_commentary(children)
Return type

Token

reaction(children)

Parse the reaction in the description node.

There should be three parts:
  • The left-hand-side of the reaction, which is a list of chemical

    names, separated by <space>+<space>.

  • The separator <space>=<space>.

  • The right-hand-side of the reaction, which is a list of chemical

    names, separated by <space>+<space>.

Return type

Token

reversibility(children)

r for reversible, ir for irreversible, ? for unknown.

Return type

Token

class metabolike.parser.brenda_transformer.RefTreeTransformer(visit_tokens=True)

Bases: BaseTransformer

citation(children)
Return type

Token

entry(children)
Return type

Dict[str, Any]

paper_stat(children)
Return type

Token

pubmed(children)
Return type

Token

ref_id(children)
Return type

Token

class metabolike.parser.brenda_transformer.SpecificInfoTreeTransformer(visit_tokens=True)

Bases: GenericTreeTransformer

substrate(children)
Return type

Token

metabolike.parser.metacyc module

class metabolike.parser.metacyc.MetacycParser(sbml, reactions=None, atom_mapping=None, pathways=None, compounds=None, publications=None, classes=None)

Bases: SBMLParser

Converting MetaCyc files to a Neo4j database. Documentation on the MetaCyc files and format FAQs can be found at:

See sbml.SBMLParser for more information on SBML parsing.

Parameters
  • sbml (Union[str, Path]) – The path to the SBML file.

  • reactions (Union[str, Path, None]) – The path to the reaction.dat file. If given, the file will be parsed and extra annotation on Reaction nodes will be added.

  • atom_mapping (Union[str, Path, None]) – The path to the atom-mappings-smiles.dat file. If given, the file will be parsed and chemical reactions in the SMILES format will be added to the Reaction nodes.

  • pathways (Union[str, Path, None]) – The path to the pathway.dat file. If given, the file will be parsed and pathway links will be added to the Reaction nodes.

  • compounds (Union[str, Path, None]) – The path to the compound.dat file. If given, the file will be parsed and annotations on Compound nodes will be added.

  • publications (Union[str, Path, None]) – The path to the publication.dat file. If given, the file will be parsed and annotations on Citation nodes will be added.

  • classes (Union[str, Path, None]) – The path to the class.dat file. If given, the file will be parsed and annotations on Compartment, Taxa, and Compound nodes will be added.

sbml_file

Filepath to the input SBML file.

input_files

A dictionary of the paths to the input .dat files.

missing_ids

A dictionary of sets of IDs that were not found in the input files. This is helpful for collecting IDs that appear to be in one class but are actually in another. _report_missing_ids() can be used to print them out.

collect_atom_mapping_dat_nodes(rxn_ids, smiles)
Parameters
  • rxn_ids (Iterable[str]) – The reaction name from the graph database.

  • smiles (Dict[str, str]) – The reaction ID -> SMILES dictionary from the atom mapping file.

Return type

List[Dict[str, Union[str, Dict[str, str]]]]

collect_citation_dat_nodes(cit_ids, pub_dat)

Annotate a citation node with data from the publication.dat file.

If there are multiple fields in the given cit_id, then the fields are separated by colons. The first field is the citation ID, the second is the evidence type (in classes.dat), the third is not documented, and the fourth is the curator’s name.

In most cases the citation ID should match: PUB-[A-Z0-9]+$, with a few exceptions containing double dashes, e.g. PUB--8, or some dashes within author names, e.g. PUB-CHIH-CHING95.

Parameters
  • cit_ids (Iterable[str]) – The citation metaId properties.

  • pub_dat (Dict[str, List[List[str]]]) – The publication.dat data.

collect_classes_dat_nodes(class_dat, cco_ids, taxa_ids)
collect_compounds_dat_nodes(all_cpds, cpd_dat)
collect_pathways_dat_nodes(all_pws, pw_dat, rxn_dat)
collect_reactions_dat_nodes(rxn_ids, rxn_dat)

Parse entries from the reaction attribute-value file, and prepare nodes to add the graph database in one transaction.

Parameters
  • rxn_ids (Iterable[str]) – Reaction full ``metaId``s.

  • rxn_dat (Dict[str, List[List[str]]]) – Output of _read_dat_file() for the reaction.dat file.

Returns

A list of dictionaries, each of which contains the information

fix_pathway_nodes(pw_nodes, all_rxns)
Some fields in the list of Pathway nodes require preprocessing

before being fed into the database. Specifically:

  • predecessors contains a list of reaction IDs wrapped in

parentheses. We need to extract the first ID as the target reaction, and take all the others as preceding events of the first one.

  • reactionLayout tells us the primary reactants and products of

the reactions in a given pathway.

  • pathwayLinks links the pathway to other pathways through

intermediate ``Compound``s.

While parsing these fields, we don’t add any new Reaction nodes. These are often hypothetical reactions and not part of the SBML file.

Parameters
  • pw_nodes (List[Dict[str, Any]]) – Output of _collect_pathways_dat_nodes().

  • all_rxns (Set[str]) – All valid reaction names.

static read_dat_file(filepath)
Return type

Dict[str, List[List[str]]]

static read_smiles_dat(filepath)
Return type

Dict[str, str]

report_missing_ids()

metabolike.parser.sbml module

class metabolike.parser.sbml.SBMLParser(sbml)

Bases: object

Converting MetaCyc files to a Neo4j database. Documentation on the MetaCyc files and format FAQs can be found at:

Parameters

sbml (Union[str, Path]) – The path to the MetaCyc SBML file to convert.

db

A SBMLClient instance. This is connected to neo4j and used

to perform all database operations. Should be closed after use.
sbml_file

Filepath to the input SBML file.

static collect_compartments(compartments)
Return type

List[Dict[str, Union[str, Dict[str, str]]]]

collect_compounds(compounds)
Return type

List[Dict[str, Union[str, Dict[str, str]]]]

collect_gene_products(gene_prods)
Return type

List[Dict[str, Union[str, Dict[str, str]]]]

static collect_groups(groups)
Return type

List[Dict[str, Union[str, Dict[str, str], List[str]]]]

Add gene products to a reaction. This could be complicated where the child nodes could be:

  1. GeneProductRef

  2. fbc:or -> GeneProductRef

  3. fbc:and -> GeneProductRef

  4. fbc:or -> {fbc:and -> GeneProductRef, GeneProductRef}

Parameters

reactions (Iterable[Reaction]) – An iterable of SBML reactions.

collect_reactions(reactions)
Return type

List[Dict[str, Union[str, Dict[str, str]]]]

static read_sbml(sbml_file)
Return type

SBMLDocument

Module contents