brenda

Reading and parsing the BRENDA text file.

A lot of the code is translated from the brendaDb R package. The BRENDA text file is organised in an EC-number specific format. The information on each EC-number is given in a very short and compact way in a part of the file. The contents of each line are described by a two/three letter acronym at the beginning of the line and start after a TAB. Empty spaces at the beginning of a line indicate a continuation line.

.. brendaDb R package: https://bioconductor.org/packages/release/bioc/html/brendaDb.html .. BRENDA text file: https://www.brenda-enzymes.org/download_brenda_without_registration.php

The contents are organised in ~40 information fields as given below. Protein information is included in #...#, literature citations are in <...>, commentaries in (...) and field- special information in {...}.

It's not officially documented, but some fields also have commentaries wrapped in |...|. These are usually for reaction-related fields, so (...) would be for substrates and |...| for products.

Protein information is given as the combination organism/Uniprot accession number where available. When this information is not given in the original paper only the organism is given.

/// indicates the end of an EC-number specific part.

`parse_brenda(filepath, cache=False, ec_nums=None)`

Parse the BRENDA text file into a dict.

This implementation focuses on extracting information from the text file, and feeding the data into a Neo4j database. The parser is implemented using Lark. A series of :class:lark.visitors.Transformer classes are used to clean the data and convert it into the format required by Neo4j.

Parameters:

Name	Type	Description	Default
`filepath`	`Union[str, Path]`	The path to the BRENDA text file.	required
`cache`	`bool`	Whether to cache the parsed data to a parquet file.	`False`
`ec_nums`	`Optional[Iterable[str]]`	A list of EC numbers to extract.	`None`

Returns:

Type	Description
`dict[str, Any]`	A dict with the `description` column from read_brenda transformed into lists of dicts stored as values, and EC numbers as keys.

Source code in parser/brenda.py

def parse_brenda(
    filepath: Union[str, Path],
    cache: bool = False,
    ec_nums: Optional[Iterable[str]] = None,
) -> dict[str, Any]:
    """Parse the BRENDA text file into a dict.

    This implementation focuses on extracting information from the text file,
    and feeding the data into a Neo4j database. The parser is implemented
    using Lark. A series of :class:`lark.visitors.Transformer` classes are used
    to clean the data and convert it into the format required by Neo4j.

    Args:
        filepath: The path to the BRENDA text file.
        cache: Whether to cache the parsed data to a parquet file.
        ec_nums: A list of EC numbers to extract.

    Returns:
        A dict with the `description` column from [read_brenda]() transformed into lists of dicts stored as values,
            and EC numbers as keys.
    """
    filepath = Path(filepath).expanduser().resolve()

    # Use parquet cache if available
    cache_file = filepath.with_suffix(".json")
    if cache_file.exists() and cache:
        logger.debug(f"Loading cached BRENDA data from {cache_file}")
        with open(cache_file, "rb") as f:
            j: dict[str, Any] = orjson.loads(f.read())

        if ec_nums:
            logger.warning("Using cached data, some ec_nums may be missing")
            j = {k: v for k, v in j.items() if k in ec_nums}
        return j

    # Read text file into pandas DataFrame, where the last column contains
    # the text that is to be parsed into trees.
    logger.debug(f"Reading BRENDA text data from {filepath}")
    df = _read_brenda(filepath, cache=cache)
    if ec_nums:
        df = df[df.ID.isin(ec_nums)]

    # Get parsers for each unique field
    parsers: dict[str, Optional[Lark]] = {"TRANSFERRED_DELETED": None}
    for field in FIELDS.keys():
        parsers[field] = _get_parser_from_field(field)

    logger.info("Parsing BRENDA text data")
    df["description"] = df.apply(
        lambda row: _text_to_tree(row.description, parsers[row.field]),
        axis=1,
    )

    # Transform data frame into a dict
    logger.info("Transforming data into dictionary")
    j = {
        k: v.groupby("field").description.apply(lambda x: list(x)[0]).to_dict()
        for k, v in df.groupby("ID")
    }

    if cache:
        logger.debug(f"Saving parsed BRENDA data to cache {cache_file}")
        with open(cache_file, "wb") as f:
            f.write(orjson.dumps(j))

    return j