metabolike.parser package
Submodules
metabolike.parser.bkms_react module
- metabolike.parser.bkms_react.get_bkms_tarball(filepath, extract=True)
The file is available at: https://bkms.brenda-enzymes.org/download.php
The compressed file (
Reactions_BKMS.tar.gz) includes the table in tab stop separated format (Excel, OpenOffice). The table contains actual data of BRENDA (release 2021.2, only reactions with naturally occuring substrates), MetaCyc (version 24.5), SABIO-RK (07/02/2021) and KEGG data, downloaded on the 23rd of April 2012. Downloading more recent KEGG data cannot be offered because a KEGG license agreement would be necessary.- Parameters
filepath (
str) – The path to store the downloaded file.extract (
bool) – Extract the file.
- Return type
None
- metabolike.parser.bkms_react.read_bkms(filepath, clean=True)
Read the BKMS-react table and prepare it for further processing.
The table contains random
^Mcharacters in some rows. These characters won’t break pandas, but they will make the parsed table unexpectedly long. Thecleanparameter can be used to remove these characters.BRENDA takes EC numbers as identifiers, so we need entries with non-empty
EC_NumberandReaction_ID_MetaCyccolumns. There is aReaction_ID_BRENDAcolumn that’s sometimes non-empty when the EC number is missing, but the field is not documented, and it’s not clear how it can be mapped to an entry in the BRENDA text file.To be extra conservative, we only keep entries with non-empty
Reaction_ID_KEGGcolumns. Only reactions in MetaCyc/BioCyc that are associated with matching EC numbers and KEGG IDs will be annotated.- Parameters
filepath (
str) – The path to the BKMS-react.tsvfile.clean (
bool) – Remove the``^M`` characters.
- Return type
A pandas dataframe
metabolike.parser.brenda module
Reading and parsing the BRENDA text file.
A lot of the code is translated from the brendaDb R package. The BRENDA text file is organised in an EC-number specific format. The information on each EC-number is given in a very short and compact way in a part of the file. The contents of each line are described by a two/three letter acronym at the beginning of the line and start after a TAB. Empty spaces at the beginning of a line indicate a continuation line.
The contents are organised in ~40 information fields as given
below. Protein information is included in #...#, literature
citations are in <...>, commentaries in (...) and field-
special information in {...}.
It’s not officially documented, but some fields also have commentaries
wrapped in |...|. These are usually for reaction-related fields, so
(...) would be for substrates and |...| for products.
Protein information is given as the combination organism/Uniprot accession number where available. When this information is not given in the original paper only the organism is given.
/// indicates the end of an EC-number specific part.
- metabolike.parser.brenda.parse_brenda(filepath, cache=False, ec_nums=None)
Parse the BRENDA text file into a dict.
This implmentation focuses on extracting information from the text file, and feeding the data into a Neo4j database. The parser is implemented using Lark. A series of
lark.visitors.Transformerclasses are used to clean the data and convert it into the format required by Neo4j.- Parameters
filepath (
Union[str,Path]) – The path to the BRENDA text file.cache (
bool) – Whether to cache the parsed data to a parquet file.ec_nums (
Optional[Iterable[str]]) – A list of EC numbers to extract.
- Return type
Dict[str,Any]- Returns
A
dictwith thedescriptioncolumn fromread_brenda()transformed into lists of dicts stored as values, and EC numbers as keys.
metabolike.parser.brenda_transformer module
- class metabolike.parser.brenda_transformer.BaseTransformer(visit_tokens=True)
Bases:
Transformer- NUM_ID(num)
- TOKEN(tok)
- protein_id(children)
- Return type
Token
- ref_id(children)
- Return type
Token
- class metabolike.parser.brenda_transformer.CommentaryOnlyTreeTransformer(visit_tokens=True)
Bases:
BaseTransformer- description(children)
- Return type
Token
- entry(children)
- Return type
Dict[str,Any]
- class metabolike.parser.brenda_transformer.GenericTreeTransformer(visit_tokens=True)
Bases:
BaseTransformerTransform extracted values from bottom-up.
Formats the tree into a dictionary, as described in
entry().- commentary(children)
- Return type
Token
- content(children)
- Return type
Dict[str,Union[str,List[int]]]
- description(children)
- Return type
Token
- entry(children)
- Return type
Dict[str,Any]
- class metabolike.parser.brenda_transformer.ReactionTreeTransformer(visit_tokens=True)
Bases:
GenericTreeTransformerCommentary in
(...)are on substrates, and in|...|on products.- commentary(children)
- Return type
Token
- entry(children)
- Return type
Dict[str,Any]
- more_commentary(children)
- Return type
Token
- reaction(children)
Parse the reaction in the description node.
- There should be three parts:
- The left-hand-side of the reaction, which is a list of chemical
names, separated by
<space>+<space>.
The separator
<space>=<space>.- The right-hand-side of the reaction, which is a list of chemical
names, separated by
<space>+<space>.
- Return type
Token
- reversibility(children)
rfor reversible,irfor irreversible,?for unknown.- Return type
Token
- class metabolike.parser.brenda_transformer.RefTreeTransformer(visit_tokens=True)
Bases:
BaseTransformer- citation(children)
- Return type
Token
- entry(children)
- Return type
Dict[str,Any]
- paper_stat(children)
- Return type
Token
- pubmed(children)
- Return type
Token
- ref_id(children)
- Return type
Token
- class metabolike.parser.brenda_transformer.SpecificInfoTreeTransformer(visit_tokens=True)
Bases:
GenericTreeTransformer- substrate(children)
- Return type
Token
metabolike.parser.metacyc module
- class metabolike.parser.metacyc.MetacycParser(sbml, reactions=None, atom_mapping=None, pathways=None, compounds=None, publications=None, classes=None)
Bases:
SBMLParserConverting MetaCyc files to a Neo4j database. Documentation on the MetaCyc files and format FAQs can be found at:
MetaCyc data files download: https://metacyc.org/downloads.shtml
MetaCyc file formats: http://bioinformatics.ai.sri.com/ptools/flatfile-format.html
See
sbml.SBMLParserfor more information on SBML parsing.- Parameters
sbml (
Union[str,Path]) – The path to the SBML file.reactions (
Union[str,Path,None]) – The path to thereaction.datfile. If given, the file will be parsed and extra annotation onReactionnodes will be added.atom_mapping (
Union[str,Path,None]) – The path to theatom-mappings-smiles.datfile. If given, the file will be parsed and chemical reactions in theSMILESformat will be added to theReactionnodes.pathways (
Union[str,Path,None]) – The path to thepathway.datfile. If given, the file will be parsed and pathway links will be added to theReactionnodes.compounds (
Union[str,Path,None]) – The path to thecompound.datfile. If given, the file will be parsed and annotations onCompoundnodes will be added.publications (
Union[str,Path,None]) – The path to thepublication.datfile. If given, the file will be parsed and annotations onCitationnodes will be added.classes (
Union[str,Path,None]) – The path to theclass.datfile. If given, the file will be parsed and annotations onCompartment,Taxa, andCompoundnodes will be added.
- sbml_file
Filepath to the input SBML file.
- input_files
A dictionary of the paths to the input
.datfiles.
- missing_ids
A dictionary of sets of IDs that were not found in the input files. This is helpful for collecting IDs that appear to be in one class but are actually in another.
_report_missing_ids()can be used to print them out.
- collect_atom_mapping_dat_nodes(rxn_ids, smiles)
- Parameters
rxn_ids (
Iterable[str]) – The reaction name from the graph database.smiles (
Dict[str,str]) – The reaction ID -> SMILES dictionary from the atom mapping file.
- Return type
List[Dict[str,Union[str,Dict[str,str]]]]
- collect_citation_dat_nodes(cit_ids, pub_dat)
Annotate a citation node with data from the publication.dat file.
If there are multiple fields in the given
cit_id, then the fields are separated by colons. The first field is the citation ID, the second is the evidence type (in classes.dat), the third is not documented, and the fourth is the curator’s name.In most cases the citation ID should match:
PUB-[A-Z0-9]+$, with a few exceptions containing double dashes, e.g.PUB--8, or some dashes within author names, e.g.PUB-CHIH-CHING95.- Parameters
cit_ids (
Iterable[str]) – The citationmetaIdproperties.pub_dat (
Dict[str,List[List[str]]]) – The publication.dat data.
- collect_classes_dat_nodes(class_dat, cco_ids, taxa_ids)
- collect_compounds_dat_nodes(all_cpds, cpd_dat)
- collect_pathways_dat_nodes(all_pws, pw_dat, rxn_dat)
- collect_reactions_dat_nodes(rxn_ids, rxn_dat)
Parse entries from the reaction attribute-value file, and prepare nodes to add the graph database in one transaction.
- Parameters
rxn_ids (
Iterable[str]) – Reaction full ``metaId``s.rxn_dat (
Dict[str,List[List[str]]]) – Output of_read_dat_file()for thereaction.datfile.
- Returns
A list of dictionaries, each of which contains the information
- fix_pathway_nodes(pw_nodes, all_rxns)
- Some fields in the list of
Pathwaynodes require preprocessing before being fed into the database. Specifically:
predecessorscontains a list of reaction IDs wrapped in
parentheses. We need to extract the first ID as the target reaction, and take all the others as preceding events of the first one.
reactionLayouttells us the primary reactants and products of
the reactions in a given pathway.
pathwayLinkslinks the pathway to other pathways through
intermediate ``Compound``s.
While parsing these fields, we don’t add any new
Reactionnodes. These are often hypothetical reactions and not part of the SBML file.- Parameters
pw_nodes (
List[Dict[str,Any]]) – Output of_collect_pathways_dat_nodes().all_rxns (
Set[str]) – All valid reaction names.
- Some fields in the list of
- static read_dat_file(filepath)
- Return type
Dict[str,List[List[str]]]
- static read_smiles_dat(filepath)
- Return type
Dict[str,str]
- report_missing_ids()
metabolike.parser.sbml module
- class metabolike.parser.sbml.SBMLParser(sbml)
Bases:
objectConverting MetaCyc files to a Neo4j database. Documentation on the MetaCyc files and format FAQs can be found at:
MetaCyc data files download: https://metacyc.org/downloads.shtml
MetaCyc file formats: http://bioinformatics.ai.sri.com/ptools/flatfile-format.html
- Parameters
sbml (
Union[str,Path]) – The path to the MetaCyc SBML file to convert.
- db
A
SBMLClientinstance. This is connected to neo4j and used
- to perform all database operations. Should be closed after use.
- sbml_file
Filepath to the input SBML file.
- static collect_compartments(compartments)
- Return type
List[Dict[str,Union[str,Dict[str,str]]]]
- collect_compounds(compounds)
- Return type
List[Dict[str,Union[str,Dict[str,str]]]]
- collect_gene_products(gene_prods)
- Return type
List[Dict[str,Union[str,Dict[str,str]]]]
- static collect_groups(groups)
- Return type
List[Dict[str,Union[str,Dict[str,str],List[str]]]]
- collect_reaction_gene_product_links(reactions)
Add gene products to a reaction. This could be complicated where the child nodes could be:
GeneProductRef
fbc:or -> GeneProductRef
fbc:and -> GeneProductRef
fbc:or -> {fbc:and -> GeneProductRef, GeneProductRef}
- Parameters
reactions (
Iterable[Reaction]) – An iterable of SBML reactions.
- collect_reactions(reactions)
- Return type
List[Dict[str,Union[str,Dict[str,str]]]]
- static read_sbml(sbml_file)
- Return type
SBMLDocument