metabolike.parser package

Submodules

metabolike.parser.bkms_react module

metabolike.parser.bkms_react.get_bkms_tarball(filepath, extract=True)

The file is available at: https://bkms.brenda-enzymes.org/download.php

The compressed file (Reactions_BKMS.tar.gz) includes the table in tab stop separated format (Excel, OpenOffice). The table contains actual data of BRENDA (release 2021.2, only reactions with naturally occuring substrates), MetaCyc (version 24.5), SABIO-RK (07/02/2021) and KEGG data, downloaded on the 23rd of April 2012. Downloading more recent KEGG data cannot be offered because a KEGG license agreement would be necessary.

Parameters

filepath (str) – The path to store the downloaded file.
extract (bool) – Extract the file.

Return type

None

metabolike.parser.bkms_react.read_bkms(filepath, clean=True)

Read the BKMS-react table and prepare it for further processing.

The table contains random ^M characters in some rows. These characters won’t break pandas, but they will make the parsed table unexpectedly long. The clean parameter can be used to remove these characters.

BRENDA takes EC numbers as identifiers, so we need entries with non-empty EC_Number and Reaction_ID_MetaCyc columns. There is a Reaction_ID_BRENDA column that’s sometimes non-empty when the EC number is missing, but the field is not documented, and it’s not clear how it can be mapped to an entry in the BRENDA text file.

To be extra conservative, we only keep entries with non-empty Reaction_ID_KEGG columns. Only reactions in MetaCyc/BioCyc that are associated with matching EC numbers and KEGG IDs will be annotated.

Parameters

filepath (str) – The path to the BKMS-react .tsv file.
clean (bool) – Remove the``^M`` characters.

Return type

A pandas dataframe

metabolike.parser.brenda module

Reading and parsing the BRENDA text file.

A lot of the code is translated from the brendaDb R package. The BRENDA text file is organised in an EC-number specific format. The information on each EC-number is given in a very short and compact way in a part of the file. The contents of each line are described by a two/three letter acronym at the beginning of the line and start after a TAB. Empty spaces at the beginning of a line indicate a continuation line.

The contents are organised in ~40 information fields as given below. Protein information is included in #...#, literature citations are in <...>, commentaries in (...) and field- special information in {...}.

It’s not officially documented, but some fields also have commentaries wrapped in |...|. These are usually for reaction-related fields, so (...) would be for substrates and |...| for products.

Protein information is given as the combination organism/Uniprot accession number where available. When this information is not given in the original paper only the organism is given.

/// indicates the end of an EC-number specific part.

metabolike.parser.brenda.parse_brenda(filepath, cache=False, ec_nums=None)

Parse the BRENDA text file into a dict.

This implmentation focuses on extracting information from the text file, and feeding the data into a Neo4j database. The parser is implemented using Lark. A series of lark.visitors.Transformer classes are used to clean the data and convert it into the format required by Neo4j.

Parameters

filepath (Union[str, Path]) – The path to the BRENDA text file.
cache (bool) – Whether to cache the parsed data to a parquet file.
ec_nums (Optional[Iterable[str]]) – A list of EC numbers to extract.

Return type

Dict[str, Any]

Returns

A dict with the description column from read_brenda() transformed into lists of dicts stored as values, and EC numbers as keys.

metabolike.parser.brenda_transformer module

class metabolike.parser.brenda_transformer.BaseTransformer(visit_tokens=True)

Bases: Transformer

NUM_ID(num)

TOKEN(tok)

protein_id(children)

Return type: Token

ref_id(children)

Return type: Token

class metabolike.parser.brenda_transformer.CommentaryOnlyTreeTransformer(visit_tokens=True)

Bases: BaseTransformer

description(children)

Return type: Token

entry(children)

Return type: Dict[str, Any]

class metabolike.parser.brenda_transformer.GenericTreeTransformer(visit_tokens=True)

Bases: BaseTransformer

Transform extracted values from bottom-up.

Formats the tree into a dictionary, as described in entry().

commentary(children)

Return type: Token

content(children)

Return type: Dict[str, Union[str, List[int]]]

description(children)

Return type: Token

entry(children)

Return type: Dict[str, Any]

class metabolike.parser.brenda_transformer.ReactionTreeTransformer(visit_tokens=True)

Bases: GenericTreeTransformer

Commentary in (...) are on substrates, and in |...| on products.

commentary(children)

Return type: Token

entry(children)

Return type: Dict[str, Any]

more_commentary(children)

Return type: Token

reaction(children)

Parse the reaction in the description node.

There should be three parts:

The left-hand-side of the reaction, which is a list of chemical
names, separated by <space>+<space>.
The separator <space>=<space>.
The right-hand-side of the reaction, which is a list of chemical
names, separated by <space>+<space>.

Return type: Token

reversibility(children)

r for reversible, ir for irreversible, ? for unknown.

Return type: Token

class metabolike.parser.brenda_transformer.RefTreeTransformer(visit_tokens=True)

Bases: BaseTransformer

citation(children)

Return type: Token

entry(children)

Return type: Dict[str, Any]

paper_stat(children)

Return type: Token

pubmed(children)

Return type: Token

ref_id(children)

Return type: Token

class metabolike.parser.brenda_transformer.SpecificInfoTreeTransformer(visit_tokens=True)

Bases: GenericTreeTransformer

substrate(children)

Return type: Token

metabolike.parser.metacyc module

class metabolike.parser.metacyc.MetacycParser(sbml, reactions=None, atom_mapping=None, pathways=None, compounds=None, publications=None, classes=None)

Bases: SBMLParser

Converting MetaCyc files to a Neo4j database. Documentation on the MetaCyc files and format FAQs can be found at:

MetaCyc data files download: https://metacyc.org/downloads.shtml
MetaCyc file formats: http://bioinformatics.ai.sri.com/ptools/flatfile-format.html
SBML FAQ: https://synonym.caltech.edu/documents/faq

See sbml.SBMLParser for more information on SBML parsing.

Parameters

sbml (Union[str, Path]) – The path to the SBML file.
reactions (Union[str, Path, None]) – The path to the reaction.dat file. If given, the file will be parsed and extra annotation on Reaction nodes will be added.
atom_mapping (Union[str, Path, None]) – The path to the atom-mappings-smiles.dat file. If given, the file will be parsed and chemical reactions in the SMILES format will be added to the Reaction nodes.
pathways (Union[str, Path, None]) – The path to the pathway.dat file. If given, the file will be parsed and pathway links will be added to the Reaction nodes.
compounds (Union[str, Path, None]) – The path to the compound.dat file. If given, the file will be parsed and annotations on Compound nodes will be added.
publications (Union[str, Path, None]) – The path to the publication.dat file. If given, the file will be parsed and annotations on Citation nodes will be added.
classes (Union[str, Path, None]) – The path to the class.dat file. If given, the file will be parsed and annotations on Compartment, Taxa, and Compound nodes will be added.

sbml_file: Filepath to the input SBML file.

input_files: A dictionary of the paths to the input .dat files.

missing_ids: A dictionary of sets of IDs that were not found in the input files. This is helpful for collecting IDs that appear to be in one class but are actually in another. _report_missing_ids() can be used to print them out.

collect_atom_mapping_dat_nodes(rxn_ids, smiles)

Parameters

rxn_ids (Iterable[str]) – The reaction name from the graph database.
smiles (Dict[str, str]) – The reaction ID -> SMILES dictionary from the atom mapping file.

Return type

List[Dict[str, Union[str, Dict[str, str]]]]

collect_citation_dat_nodes(cit_ids, pub_dat)

Annotate a citation node with data from the publication.dat file.

If there are multiple fields in the given cit_id, then the fields are separated by colons. The first field is the citation ID, the second is the evidence type (in classes.dat), the third is not documented, and the fourth is the curator’s name.

In most cases the citation ID should match: PUB-[A-Z0-9]+$, with a few exceptions containing double dashes, e.g. PUB--8, or some dashes within author names, e.g. PUB-CHIH-CHING95.

Parameters

cit_ids (Iterable[str]) – The citation metaId properties.
pub_dat (Dict[str, List[List[str]]]) – The publication.dat data.

collect_classes_dat_nodes(class_dat, cco_ids, taxa_ids)

collect_compounds_dat_nodes(all_cpds, cpd_dat)

collect_pathways_dat_nodes(all_pws, pw_dat, rxn_dat)

collect_reactions_dat_nodes(rxn_ids, rxn_dat)

Parse entries from the reaction attribute-value file, and prepare nodes to add the graph database in one transaction.

Parameters

rxn_ids (Iterable[str]) – Reaction full ``metaId``s.
rxn_dat (Dict[str, List[List[str]]]) – Output of _read_dat_file() for the reaction.dat file.

Returns

A list of dictionaries, each of which contains the information

fix_pathway_nodes(pw_nodes, all_rxns)

Some fields in the list of Pathway nodes require preprocessing: before being fed into the database. Specifically:

predecessors contains a list of reaction IDs wrapped in

parentheses. We need to extract the first ID as the target reaction, and take all the others as preceding events of the first one.

reactionLayout tells us the primary reactants and products of

the reactions in a given pathway.

pathwayLinks links the pathway to other pathways through

intermediate ``Compound``s.

While parsing these fields, we don’t add any new Reaction nodes. These are often hypothetical reactions and not part of the SBML file.

Parameters

pw_nodes (List[Dict[str, Any]]) – Output of _collect_pathways_dat_nodes().
all_rxns (Set[str]) – All valid reaction names.

static read_dat_file(filepath)

Return type: Dict[str, List[List[str]]]

static read_smiles_dat(filepath)

Return type: Dict[str, str]

report_missing_ids()

metabolike.parser.sbml module

class metabolike.parser.sbml.SBMLParser(sbml)

Bases: object

Converting MetaCyc files to a Neo4j database. Documentation on the MetaCyc files and format FAQs can be found at:

MetaCyc data files download: https://metacyc.org/downloads.shtml
MetaCyc file formats: http://bioinformatics.ai.sri.com/ptools/flatfile-format.html
SBML FAQ: https://synonym.caltech.edu/documents/faq

Parameters: sbml (Union[str, Path]) – The path to the MetaCyc SBML file to convert.

db: A SBMLClient instance. This is connected to neo4j and used

to perform all database operations. Should be closed after use.

sbml_file: Filepath to the input SBML file.

static collect_compartments(compartments)

Return type: List[Dict[str, Union[str, Dict[str, str]]]]

collect_compounds(compounds)

Return type: List[Dict[str, Union[str, Dict[str, str]]]]

collect_gene_products(gene_prods)

Return type: List[Dict[str, Union[str, Dict[str, str]]]]

static collect_groups(groups)

Return type: List[Dict[str, Union[str, Dict[str, str], List[str]]]]

collect_reaction_gene_product_links(reactions)

Add gene products to a reaction. This could be complicated where the child nodes could be:

GeneProductRef
fbc:or -> GeneProductRef
fbc:and -> GeneProductRef
fbc:or -> {fbc:and -> GeneProductRef, GeneProductRef}

Parameters: reactions (Iterable[Reaction]) – An iterable of SBML reactions.

collect_reactions(reactions)

Return type: List[Dict[str, Union[str, Dict[str, str]]]]

static read_sbml(sbml_file)

Return type: SBMLDocument

metabolike.parser package

Submodules

metabolike.parser.bkms_react module

metabolike.parser.brenda module

metabolike.parser.brenda_transformer module

metabolike.parser.metacyc module

metabolike.parser.sbml module

Module contents