Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development #121

Merged
merged 24 commits into from
Jan 17, 2025
Merged

Development #121

merged 24 commits into from
Jan 17, 2025

Conversation

clnsmth
Copy link
Contributor

@clnsmth clnsmth commented Jan 17, 2025

No description provided.

Add missing parameters to the `annotate_workbooks` function to ensure
correct argument propagation to its subfunctions.
Implement a cache-clearing mechanism before each OntoGPT call to
mitigate issues where cached results, particularly those without
grounded concepts, could lead to processing errors. This ensures that
each call to OntoGPT is fresh and produces reliable results.
Implement a strategy to combine multiple OntoGPT runs for each input to
improve the consistency and completeness of concept grounding. This
approach addresses the variability inherent in the OntoGPT process,
resulting in more reliable and accurate annotations.
Create a new module to facilitate benchmark testing, allowing for
performance evaluation and optimization.
Add logging capabilities to enhance debugging and runtime monitoring.
Add logging for performance metrics to enable in-depth analysis and
optimization.

- Create a context manager to log metrics of interest (runtime and
memory usage).
- Estimate tokens per LLM call using word count.
Replace print statements with logging statements to enable more
structured and persistent output. This change provides flexibility for
capturing and analyzing runtime information.
Add functionality to collect and analyze benchmark data, including a 
dedicated test suite to evaluate this routine.

We have opted for a baseline comparison method to evaluate the 
performance of our algorithm across different parameterizations. This 
approach offers several advantages, including efficiency and 
interpretability. By directly comparing each parameterization to a 
fixed baseline, we can quickly assess its relative performance and 
identify the optimal configuration. While this method may not uncover 
subtle differences between parameterizations that are both better or 
worse than the baseline, it provides a practical and timely solution for 
our specific goals.
Remove an extra space from the OntoGPT `extract` command construction to
prevent potential errors and ensure the command executes as expected.
Optimize OntoGPT calls by specifying the `ollama_chat` model within the
`extract` command, leveraging performance improvements recommended by
the `litellm` package.
Add a `temperature` parameter to OntoGPT calls, allowing users to
control the model's behavior and adjust the level of creativity or
randomness in the generated output.
Update templates to improve ontology grounding, specifically:

1. Improve template prompts to produce more accurate and precise
results.

2. Relax vocabulary branch constraints to enable broader capture of
concepts outside of the target branch due to relevant concepts appearing
in multiple branches within the vocabulary. Do this for all templates
except `contains_process` and `env_medium`, where concepts are
sufficiently constrained to a single branch.

By doing this we increase our reliance on effective prompts to guide the
LLM to extract relevant concepts without extracting irrelevant concepts.
The issue of irrelevant concepts may be addressed downstream in an
additional post processing step that trims out these concepts.

Note vocabulary constraints don't seem to work in vocabularies using the
BioPortal API.

3. Replace semantically descriptive labels (e.g., `measurement_type`) in
templates with less semantically related labels (e.g., `output`). This
change mitigates the risk of the LLM misinterpreting labels as
placeholders for extracted values, leading to parsing errors and
incorrect results.
Updated the `expand_curie` function to utilize a significantly larger
prefix map, enabling the expansion of a wider range of CURIEs.
Correct the expand_curie function to handle CURIEs containing more than
one semicolon, preventing the ValueError: too many values to unpack
error.
Implement a visualization to assess the grounding success rates of
different OntoGPT configurations. This visualization utilizes a 100%
stacked bar chart to compare and contrast the performance of various
configurations.
Add logging capabilities to the `benchmark_against_standard` function to
provide insights into the ongoing execution process, especially helpful
for this time-consuming operation.
Create a set of test data containing term-set similarity scores for
various configurations, enabling unit testing of downstream functions
that analyze and interpret these scores.
Implement a visualization to assess the accuracy of different OntoGPT
configurations relative to a baseline standard for each predicate 
represented by OntoGPT templates. Use a simple box plot to 
effectively display and compare similarity metrics across predicate
values.
Implement a visualization to assess the accuracy of different OntoGPT
configurations relative to a baseline. Use a simple box plot to display
and compare configurations.
Make writing plots to file optional in the `plot_grounding_rates`
function by introducing a new parameter to control this behavior. This
allows for flexible usage, including previewing plots without generating
files.
Remove the outdated `add_dataset_annotations_to_workbook` function, as
it lacks the necessary granularity for predicate-level categorization of
semantic annotations, a crucial aspect of our current annotation model.

While alternative approaches exist (e.g., annotating with terms from
multiple vocabularies and then categorizing based on branch), the
ongoing development and active community support for OntoGPT suggest a
more promising long-term solution.
Consolidate multiple OntoGPT workbook annotator functions into a single,
unified function to improve code maintainability, reduce redundancy, and
enhance overall code clarity.
Resolve an issue in the `add_predicate_annotations_to_workbook` function
that prevents it from returning the expected results.
Update `.readthedocs.yaml` to explicitly specify the path to
`config.py`. This ensures proper documentation builds and avoids
potential issues with an upcoming deprecation of inferred configuration.
@clnsmth clnsmth merged commit ee47493 into main Jan 17, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant