| Documentation |
Ingest for Micronutrient Information Center using OntoGPT
The Microntrient Information Center (MIC) Ingest differs from our other standard modular ingests in that the data source is not simple flat files downloaded from an authority. Instead, the information for the MIC ingest comes from the (Linus Pauling Institute Website)[https://lpi.oregonstate.edu/] website within the (Micronutrient Information Center)[https://lpi.oregonstate.edu/mic]. We intend to use OntoGPT to scrape the content of the site and assemble it into rows of data. Then we hope to use our existing Koza system and GitHub release infrastructure to import this resource as nodes and edges into the Monarch KG.
As mentioned above, source files for this ingest will be generated by scraping using OntoGPT rather than downloaded from sources directly. OntoGPT will use the site map to crawl through all of the pages of the MIC to generate output with rows of
As of ontogpt's latest version (v1.0.10) it now includes a template for extracting the following relation types from MIC pages:
- Nutrient to disease
- Nutrient to phenotype
- Nutrient to biological process
- Nutrient to health status of a body part or system (like "calcium supports healthy bones")
- Nutrient to food source
- Nutrient to nutrient
A couple caveats:
- I haven't set the relations to ground to RO or Biolink types - this will require some discussion to identify appropriate mappings
- References for each claim are extracted, though only as a list of their numerical identifiers in the page's reference list. Other approaches introduced too much hallucination and/or the LLM just refused to parse more than a fraction of the reflist. Could be solved with some minimal scraping.
Use this section describe the nodes and edges generated from the ingest for instance
- Gene Nodes - Description of which nodes are created and what data may be excluded from the ingest.
- Gene → Disease - Similar description of the edges and which edges are created or how the data may be filtered.
Metadata for the infest is in the metadata.yaml
file and may require some adjustment depending on your configuration. Data files and locations are listed in the download.yaml
file which is used to download all of the data sources before the transform. The transform.yaml
file and python file transform.py
contain the configuration and transformation code, respectively.
For more information, see the Koza documentation and kghub-downloader.
Dependencies are listed in pyproject.toml
file. This project uses pytest for development testing located in the tests
directory to test the functionality of your transform.
The documentation for this ingest is in this README.md
file and additional documentation is in the docs
directory.
Note: After the GitHub Actions for deploying documentation runs, the documentation will be automatically deployed to GitHub Pages.
This project is set up with several GitHub Actions workflows.
You should not need to modify these workflows unless you want to change the behavior.
The workflows are located in the .github/workflows
directory:
test.yaml
: Run the pytest suite.create-release.yaml
: Create a new release once a week, or manually.deploy-docs.yaml
: Deploy the documentation to GitHub Pages (on pushes to main).update-docs.yaml
: After a release, update the documentation with node/edge reports.
cd mic-ingest
make install
# or
poetry install
Note that the
make install
command is just a convenience wrapper aroundpoetry install
.
Once installed, you can check that everything is working as expected:
# Run the pytest suite
make test
# Download the data and run the Koza transform
make download
make run
This project is set up with a Makefile for common tasks.
To see available options:
make help
Download the data for the mic_ingest transform:
poetry run mic_ingest download
To run the Koza transform for mic-ingest:
poetry run mic_ingest transform
To see available options:
poetry run mic_ingest download --help
# or
poetry run mic_ingest transform --help
To run the test suite:
make test
This project was generated using monarch-initiative/cookiecutter-monarch-ingest.
Keep this project up to date using cruft by occasionally running in the project directory:cruft updateFor more information, see the cruft documentation