metacyc

`MetacycParser(sbml, reactions=None, atom_mapping=None, pathways=None, compounds=None, publications=None, classes=None)`

Bases: SBMLParser

Converting MetaCyc files to a Neo4j database. Documentation on the MetaCyc files and format FAQs can be found at:

MetaCyc data files download: https://metacyc.org/downloads.shtml
MetaCyc file formats: https://bioinformatics.ai.sri.com/ptools/flatfile-format.html
SBML FAQ: https://synonym.caltech.edu/documents/faq

See :class:.sbml.SBMLParser for more information on SBML parsing.

Parameters:

Name	Type	Description	Default
`sbml`	`Union[str, Path]`	The path to the SBML file.	required
`reactions`	`Optional[Union[str, Path]]`	The path to the `reaction.dat` file. If given, the file will be parsed and extra annotation on `Reaction` nodes will be added.	`None`
`atom_mapping`	`Optional[Union[str, Path]]`	The path to the `atom-mappings-smiles.dat` file. If given, the file will be parsed and chemical reactions in the `SMILES` format will be added to the `Reaction` nodes.	`None`
`pathways`	`Optional[Union[str, Path]]`	The path to the `pathway.dat` file. If given, the file will be parsed and pathway links will be added to the `Reaction` nodes.	`None`
`compounds`	`Optional[Union[str, Path]]`	The path to the `compound.dat` file. If given, the file will be parsed and annotations on `Compound` nodes will be added.	`None`
`publications`	`Optional[Union[str, Path]]`	The path to the `publication.dat` file. If given, the file will be parsed and annotations on `Citation` nodes will be added.	`None`
`classes`	`Optional[Union[str, Path]]`	The path to the `class.dat` file. If given, the file will be parsed and annotations on `Compartment`, `Taxa`, and `Compound` nodes will be added.	`None`

Attributes:

Name	Type	Description
`sbml_file`		Filepath to the input SBML file.
`input_files`		A dictionary of the paths to the input `.dat` files.
`missing_ids`	`dict[str, set[str]]`	A dictionary of sets of IDs that were not found in the input files. This is helpful for collecting IDs that appear to be in one class but are actually in another. :meth:`._report_missing_ids` can be used to print them out.

Source code in parser/metacyc.py

def __init__(
    self,
    sbml: Union[str, Path],
    reactions: Optional[Union[str, Path]] = None,
    atom_mapping: Optional[Union[str, Path]] = None,
    pathways: Optional[Union[str, Path]] = None,
    compounds: Optional[Union[str, Path]] = None,
    publications: Optional[Union[str, Path]] = None,
    classes: Optional[Union[str, Path]] = None,
):
    # Neo4j driver and SBML file path
    super().__init__(sbml)

    # File paths
    self.input_files = {
        "reactions": validate_path(reactions),
        "atom_mapping": validate_path(atom_mapping),
        "pathways": validate_path(pathways),
        "compounds": validate_path(compounds),
        "publications": validate_path(publications),
        "classes": validate_path(classes),
    }
    logger.info(f"Input files: {self.input_files}")

    # Placeholder for missing IDs in the dat files
    self.missing_ids: dict[str, set[str]] = {
        "reactions": set(),
        "atom_mappings": set(),
        "pathways": set(),
        "compounds": set(),
        "publications": set(),
        "compartments": set(),
        "taxon": set(),
    }

`collect_atom_mapping_dat_nodes(rxn_ids, smiles)`

Parameters:

Name	Type	Description	Default
`rxn_ids`	`Iterable[str]`	The reaction name from the graph database.	required
`smiles`	`dict[str, str]`	The reaction ID -> SMILES dictionary from the atom mapping file.	required

Source code in parser/metacyc.py

def collect_atom_mapping_dat_nodes(
    self, rxn_ids: Iterable[str], smiles: dict[str, str]
) -> list[dict[str, Union[str, dict[str, str]]]]:
    """
    Args:
        rxn_ids: The reaction name from the graph database.
        smiles: The reaction ID -> SMILES dictionary from the atom
            mapping file.
    """
    nodes = []
    for rxn_id in tqdm(rxn_ids, desc="atom_mapping.dat file"):
        canonical_id = self._find_rxn_canonical_id(rxn_id, smiles.keys())
        if canonical_id not in smiles:
            self.missing_ids["atom_mappings"].add(canonical_id)
            continue
        node = {
            "name": rxn_id,
            "props": {"smilesAtomMapping": smiles[canonical_id]},
        }
        nodes.append(node)

    return nodes

`collect_citation_dat_nodes(cit_ids, pub_dat)`

Annotate a citation node with data from the publication.dat file.

If there are multiple fields in the given cit_id, then the fields are separated by colons. The first field is the citation ID, the second is the evidence type (in classes.dat), the third is not documented, and the fourth is the curator's name.

In most cases the citation ID should match: PUB-[A-Z0-9]+$, with a few exceptions containing double dashes, e.g. PUB--8, or some dashes within author names, e.g. PUB-CHIH-CHING95.

Parameters:

Name	Type	Description	Default
`cit_ids`	`Iterable[str]`	The citation `metaId` properties.	required
`pub_dat`	`dict[str, list[list[str]]]`	The publication.dat data.	required

Source code in parser/metacyc.py

def collect_citation_dat_nodes(
    self, cit_ids: Iterable[str], pub_dat: dict[str, list[list[str]]]
):
    """Annotate a citation node with data from the publication.dat file.

    If there are multiple fields in the given ``cit_id``, then the fields are
    separated by colons. The first field is the citation ID, the second is the
    evidence type (in classes.dat), the third is not documented, and the
    fourth is the curator's name.

    In most cases the citation ID should match: ``PUB-[A-Z0-9]+$``,
    with a few exceptions containing double dashes, e.g. ``PUB--8``,
    or some dashes within author names, e.g. ``PUB-CHIH-CHING95``.

    Args:
        cit_ids: The citation ``metaId`` properties.
        pub_dat: The publication.dat data.
    """
    # TODO: deal with evidence frames. Evidence frames are in the form of
    # 10066805:EV-EXP-IDA:3354997111:hartmut
    nodes = []
    for cit_id in tqdm(cit_ids, desc="pubs.dat file"):
        pub_dat_id = re.sub(r"[\[\]\s,']", "", cit_id.split(":")[0].upper())
        pub_dat_id = re.sub(r"-(\d+)", r"\1", pub_dat_id)
        if pub_dat_id == "BOREJSZA-WYSOCKI94":
            pub_dat_id = "BOREJSZAWYSOCKI94"  # only exception with dash removed
        if not pub_dat_id:
            continue
        pub_dat_id = "PUB-" + pub_dat_id
        if pub_dat_id not in pub_dat:
            self.missing_ids["publications"].add(pub_dat_id)
            continue

        lines = pub_dat[pub_dat_id]
        node = {"metaId": cit_id, "props": {"citationId": pub_dat_id}}
        node = self._dat_entry_to_node(
            node,
            lines,
            props_str_keys={
                "DOI-ID",
                "PUBMED-ID",
                "MEDLINE-ID",
                "TITLE",
                "SOURCE",
                "YEAR",
                "URL",
                "REFERENT-FRAME",
            },
            props_list_keys={"AUTHORS"},
        )
        nodes.append(node)

    return nodes

`collect_reactions_dat_nodes(rxn_ids, rxn_dat)`

Parse entries from the reaction attribute-value file, and prepare nodes to add the graph database in one transaction.

Parameters:

Name	Type	Description	Default
`rxn_ids`	`Iterable[str]`	Reaction full `metaId`s.	required
`rxn_dat`	`dict[str, list[list[str]]]`	Output of self.read_dat_file for the `reaction.dat` file.	required

Returns:

Type	Description
`list[dict[str, Any]]`	A list of dictionaries, each of which contains the information

Source code in parser/metacyc.py

def collect_reactions_dat_nodes(
    self, rxn_ids: Iterable[str], rxn_dat: dict[str, list[list[str]]]
) -> list[dict[str, Any]]:
    """Parse entries from the reaction attribute-value file, and prepare nodes to add the graph
    database in one transaction.

    Args:
        rxn_ids: Reaction *full* `metaId`s.
        rxn_dat: Output of [self.read_dat_file]() for the `reaction.dat` file.

    Returns:
        A list of dictionaries, each of which contains the information
    """
    nodes: list[dict[str, Any]] = []
    for rxn_id in rxn_ids:
        canonical_id = self._find_rxn_canonical_id(rxn_id, rxn_dat.keys())
        if canonical_id not in rxn_dat:
            self.missing_ids["reactions"].add(canonical_id)
            continue

        node = {"name": rxn_id, "props": {"canonicalId": canonical_id}}
        lines = rxn_dat[canonical_id]
        node = self._dat_entry_to_node(
            node,
            lines,
            props_str_keys={
                "GIBBS-0",
                "STD-REDUCTION-POTENTIAL",
                "REACTION-DIRECTION",
                "REACTION-BALANCE-STATUS",
                "SYSTEMATIC-NAME",
                "COMMENT",  # TODO: link to other nodes
            },
            props_list_keys={"SYNONYMS", "TYPES"},
            node_list_keys={"IN-PATHWAY", "CITATIONS", "RXN-LOCATIONS"},
            prop_num_keys={"GIBBS-0", "STD-REDUCTION-POTENTIAL"},
            prop_enum_keys={"REACTION-BALANCE-STATUS", "REACTION-DIRECTION"},
        )

        nodes.append(node)

    return nodes

`fix_pathway_nodes(pw_nodes, all_rxns)`

Some fields in the list of Pathway nodes require preprocessing before being fed into the database. Specifically:

predecessors contains a list of reaction IDs wrapped in parentheses. We need to extract the first ID as the target reaction, and take all the others as preceding events of the first one.
reactionLayout tells us the primary reactants and products of the reactions in a given pathway.
pathwayLinks links the pathway to other pathways through intermediate Compounds.

While parsing these fields, we don't add any new Reaction nodes. These are often hypothetical reactions and not part of the SBML file.

Parameters:

Name	Type	Description	Default
`pw_nodes`	`list[dict[str, Any]]`	Output of :meth:`_collect_pathways_dat_nodes`.	required
`all_rxns`	`set[str]`	All valid reaction names.	required

Source code in parser/metacyc.py

def fix_pathway_nodes(self, pw_nodes: list[dict[str, Any]], all_rxns: set[str]):
    """Some fields in the list of ``Pathway`` nodes require preprocessing before being fed into
    the database. Specifically:

    * ``predecessors`` contains a list of reaction IDs wrapped in
     parentheses. We need to extract the first ID as the target reaction,
     and take all the others as preceding events of the first one.
    * ``reactionLayout`` tells us the primary reactants and products of
     the reactions in a given pathway.
    * ``pathwayLinks`` links the pathway to other pathways through
     intermediate ``Compound``s.

    While parsing these fields, we don't add any new ``Reaction`` nodes.
    These are often hypothetical reactions and not part of the SBML file.

    Args:
        pw_nodes: Output of :meth:`_collect_pathways_dat_nodes`.
        all_rxns: All valid reaction names.
    """
    for i, n in enumerate(pw_nodes):
        if predecessors := n.get("predecessors"):
            predecessors: list[str]
            new_pred = []
            for v in predecessors:
                if pred := self._parse_pathway_predecessors(v, all_rxns):
                    new_pred.append(pred)

            pw_nodes[i]["predecessors"] = new_pred

        if rxn_layout := n.get("reactionLayout"):
            rxn_layout: list[str]
            rxn_prim_cpds = []
            for v in rxn_layout:
                rxn_id, d = self._parse_reaction_layout(v)
                if rxn_id:
                    rxn_prim_cpds.append({"reaction": rxn_id, **d})

            pw_nodes[i]["reactionLayout"] = rxn_prim_cpds

        if pw_links := n.get("pathwayLinks"):
            pw_links: list[str]
            conn_pathways = []
            for v in pw_links:
                cpd_id, pw_ids, direction = self._parse_pathway_links(v)
                if pw_ids:
                    conn_pathways.append(
                        {"cpd": cpd_id, "pathways": pw_ids, "direction": direction}
                    )

            pw_nodes[i]["pathwayLinks"] = conn_pathways

    return pw_nodes