Skip to content

metacyc

MetacycParser(sbml, reactions=None, atom_mapping=None, pathways=None, compounds=None, publications=None, classes=None)

Bases: SBMLParser

Converting MetaCyc files to a Neo4j database. Documentation on the MetaCyc files and format FAQs can be found at:

  • MetaCyc data files download: https://metacyc.org/downloads.shtml
  • MetaCyc file formats: https://bioinformatics.ai.sri.com/ptools/flatfile-format.html
  • SBML FAQ: https://synonym.caltech.edu/documents/faq

See :class:.sbml.SBMLParser for more information on SBML parsing.

Parameters:

Name Type Description Default
sbml Union[str, Path]

The path to the SBML file.

required
reactions Optional[Union[str, Path]]

The path to the reaction.dat file. If given, the file will be parsed and extra annotation on Reaction nodes will be added.

None
atom_mapping Optional[Union[str, Path]]

The path to the atom-mappings-smiles.dat file. If given, the file will be parsed and chemical reactions in the SMILES format will be added to the Reaction nodes.

None
pathways Optional[Union[str, Path]]

The path to the pathway.dat file. If given, the file will be parsed and pathway links will be added to the Reaction nodes.

None
compounds Optional[Union[str, Path]]

The path to the compound.dat file. If given, the file will be parsed and annotations on Compound nodes will be added.

None
publications Optional[Union[str, Path]]

The path to the publication.dat file. If given, the file will be parsed and annotations on Citation nodes will be added.

None
classes Optional[Union[str, Path]]

The path to the class.dat file. If given, the file will be parsed and annotations on Compartment, Taxa, and Compound nodes will be added.

None

Attributes:

Name Type Description
sbml_file

Filepath to the input SBML file.

input_files

A dictionary of the paths to the input .dat files.

missing_ids dict[str, set[str]]

A dictionary of sets of IDs that were not found in the input files. This is helpful for collecting IDs that appear to be in one class but are actually in another. :meth:._report_missing_ids can be used to print them out.

Source code in parser/metacyc.py
def __init__(
    self,
    sbml: Union[str, Path],
    reactions: Optional[Union[str, Path]] = None,
    atom_mapping: Optional[Union[str, Path]] = None,
    pathways: Optional[Union[str, Path]] = None,
    compounds: Optional[Union[str, Path]] = None,
    publications: Optional[Union[str, Path]] = None,
    classes: Optional[Union[str, Path]] = None,
):
    # Neo4j driver and SBML file path
    super().__init__(sbml)

    # File paths
    self.input_files = {
        "reactions": validate_path(reactions),
        "atom_mapping": validate_path(atom_mapping),
        "pathways": validate_path(pathways),
        "compounds": validate_path(compounds),
        "publications": validate_path(publications),
        "classes": validate_path(classes),
    }
    logger.info(f"Input files: {self.input_files}")

    # Placeholder for missing IDs in the dat files
    self.missing_ids: dict[str, set[str]] = {
        "reactions": set(),
        "atom_mappings": set(),
        "pathways": set(),
        "compounds": set(),
        "publications": set(),
        "compartments": set(),
        "taxon": set(),
    }

collect_atom_mapping_dat_nodes(rxn_ids, smiles)

Parameters:

Name Type Description Default
rxn_ids Iterable[str]

The reaction name from the graph database.

required
smiles dict[str, str]

The reaction ID -> SMILES dictionary from the atom mapping file.

required
Source code in parser/metacyc.py
def collect_atom_mapping_dat_nodes(
    self, rxn_ids: Iterable[str], smiles: dict[str, str]
) -> list[dict[str, Union[str, dict[str, str]]]]:
    """
    Args:
        rxn_ids: The reaction name from the graph database.
        smiles: The reaction ID -> SMILES dictionary from the atom
            mapping file.
    """
    nodes = []
    for rxn_id in tqdm(rxn_ids, desc="atom_mapping.dat file"):
        canonical_id = self._find_rxn_canonical_id(rxn_id, smiles.keys())
        if canonical_id not in smiles:
            self.missing_ids["atom_mappings"].add(canonical_id)
            continue
        node = {
            "name": rxn_id,
            "props": {"smilesAtomMapping": smiles[canonical_id]},
        }
        nodes.append(node)

    return nodes

collect_citation_dat_nodes(cit_ids, pub_dat)

Annotate a citation node with data from the publication.dat file.

If there are multiple fields in the given cit_id, then the fields are separated by colons. The first field is the citation ID, the second is the evidence type (in classes.dat), the third is not documented, and the fourth is the curator's name.

In most cases the citation ID should match: PUB-[A-Z0-9]+$, with a few exceptions containing double dashes, e.g. PUB--8, or some dashes within author names, e.g. PUB-CHIH-CHING95.

Parameters:

Name Type Description Default
cit_ids Iterable[str]

The citation metaId properties.

required
pub_dat dict[str, list[list[str]]]

The publication.dat data.

required
Source code in parser/metacyc.py
def collect_citation_dat_nodes(
    self, cit_ids: Iterable[str], pub_dat: dict[str, list[list[str]]]
):
    """Annotate a citation node with data from the publication.dat file.

    If there are multiple fields in the given ``cit_id``, then the fields are
    separated by colons. The first field is the citation ID, the second is the
    evidence type (in classes.dat), the third is not documented, and the
    fourth is the curator's name.

    In most cases the citation ID should match: ``PUB-[A-Z0-9]+$``,
    with a few exceptions containing double dashes, e.g. ``PUB--8``,
    or some dashes within author names, e.g. ``PUB-CHIH-CHING95``.

    Args:
        cit_ids: The citation ``metaId`` properties.
        pub_dat: The publication.dat data.
    """
    # TODO: deal with evidence frames. Evidence frames are in the form of
    # 10066805:EV-EXP-IDA:3354997111:hartmut
    nodes = []
    for cit_id in tqdm(cit_ids, desc="pubs.dat file"):
        pub_dat_id = re.sub(r"[\[\]\s,']", "", cit_id.split(":")[0].upper())
        pub_dat_id = re.sub(r"-(\d+)", r"\1", pub_dat_id)
        if pub_dat_id == "BOREJSZA-WYSOCKI94":
            pub_dat_id = "BOREJSZAWYSOCKI94"  # only exception with dash removed
        if not pub_dat_id:
            continue
        pub_dat_id = "PUB-" + pub_dat_id
        if pub_dat_id not in pub_dat:
            self.missing_ids["publications"].add(pub_dat_id)
            continue

        lines = pub_dat[pub_dat_id]
        node = {"metaId": cit_id, "props": {"citationId": pub_dat_id}}
        node = self._dat_entry_to_node(
            node,
            lines,
            props_str_keys={
                "DOI-ID",
                "PUBMED-ID",
                "MEDLINE-ID",
                "TITLE",
                "SOURCE",
                "YEAR",
                "URL",
                "REFERENT-FRAME",
            },
            props_list_keys={"AUTHORS"},
        )
        nodes.append(node)

    return nodes

collect_reactions_dat_nodes(rxn_ids, rxn_dat)

Parse entries from the reaction attribute-value file, and prepare nodes to add the graph database in one transaction.

Parameters:

Name Type Description Default
rxn_ids Iterable[str]

Reaction full metaIds.

required
rxn_dat dict[str, list[list[str]]]

Output of self.read_dat_file for the reaction.dat file.

required

Returns:

Type Description
list[dict[str, Any]]

A list of dictionaries, each of which contains the information

Source code in parser/metacyc.py
def collect_reactions_dat_nodes(
    self, rxn_ids: Iterable[str], rxn_dat: dict[str, list[list[str]]]
) -> list[dict[str, Any]]:
    """Parse entries from the reaction attribute-value file, and prepare nodes to add the graph
    database in one transaction.

    Args:
        rxn_ids: Reaction *full* `metaId`s.
        rxn_dat: Output of [self.read_dat_file]() for the `reaction.dat` file.

    Returns:
        A list of dictionaries, each of which contains the information
    """
    nodes: list[dict[str, Any]] = []
    for rxn_id in rxn_ids:
        canonical_id = self._find_rxn_canonical_id(rxn_id, rxn_dat.keys())
        if canonical_id not in rxn_dat:
            self.missing_ids["reactions"].add(canonical_id)
            continue

        node = {"name": rxn_id, "props": {"canonicalId": canonical_id}}
        lines = rxn_dat[canonical_id]
        node = self._dat_entry_to_node(
            node,
            lines,
            props_str_keys={
                "GIBBS-0",
                "STD-REDUCTION-POTENTIAL",
                "REACTION-DIRECTION",
                "REACTION-BALANCE-STATUS",
                "SYSTEMATIC-NAME",
                "COMMENT",  # TODO: link to other nodes
            },
            props_list_keys={"SYNONYMS", "TYPES"},
            node_list_keys={"IN-PATHWAY", "CITATIONS", "RXN-LOCATIONS"},
            prop_num_keys={"GIBBS-0", "STD-REDUCTION-POTENTIAL"},
            prop_enum_keys={"REACTION-BALANCE-STATUS", "REACTION-DIRECTION"},
        )

        nodes.append(node)

    return nodes

fix_pathway_nodes(pw_nodes, all_rxns)

Some fields in the list of Pathway nodes require preprocessing before being fed into the database. Specifically:

  • predecessors contains a list of reaction IDs wrapped in parentheses. We need to extract the first ID as the target reaction, and take all the others as preceding events of the first one.
  • reactionLayout tells us the primary reactants and products of the reactions in a given pathway.
  • pathwayLinks links the pathway to other pathways through intermediate Compounds.

While parsing these fields, we don't add any new Reaction nodes. These are often hypothetical reactions and not part of the SBML file.

Parameters:

Name Type Description Default
pw_nodes list[dict[str, Any]]

Output of :meth:_collect_pathways_dat_nodes.

required
all_rxns set[str]

All valid reaction names.

required
Source code in parser/metacyc.py
def fix_pathway_nodes(self, pw_nodes: list[dict[str, Any]], all_rxns: set[str]):
    """Some fields in the list of ``Pathway`` nodes require preprocessing before being fed into
    the database. Specifically:

    * ``predecessors`` contains a list of reaction IDs wrapped in
     parentheses. We need to extract the first ID as the target reaction,
     and take all the others as preceding events of the first one.
    * ``reactionLayout`` tells us the primary reactants and products of
     the reactions in a given pathway.
    * ``pathwayLinks`` links the pathway to other pathways through
     intermediate ``Compound``s.

    While parsing these fields, we don't add any new ``Reaction`` nodes.
    These are often hypothetical reactions and not part of the SBML file.

    Args:
        pw_nodes: Output of :meth:`_collect_pathways_dat_nodes`.
        all_rxns: All valid reaction names.
    """
    for i, n in enumerate(pw_nodes):
        if predecessors := n.get("predecessors"):
            predecessors: list[str]
            new_pred = []
            for v in predecessors:
                if pred := self._parse_pathway_predecessors(v, all_rxns):
                    new_pred.append(pred)

            pw_nodes[i]["predecessors"] = new_pred

        if rxn_layout := n.get("reactionLayout"):
            rxn_layout: list[str]
            rxn_prim_cpds = []
            for v in rxn_layout:
                rxn_id, d = self._parse_reaction_layout(v)
                if rxn_id:
                    rxn_prim_cpds.append({"reaction": rxn_id, **d})

            pw_nodes[i]["reactionLayout"] = rxn_prim_cpds

        if pw_links := n.get("pathwayLinks"):
            pw_links: list[str]
            conn_pathways = []
            for v in pw_links:
                cpd_id, pw_ids, direction = self._parse_pathway_links(v)
                if pw_ids:
                    conn_pathways.append(
                        {"cpd": cpd_id, "pathways": pw_ids, "direction": direction}
                    )

            pw_nodes[i]["pathwayLinks"] = conn_pathways

    return pw_nodes