-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare few-shot GPT4 features to embedding features for EFO term precision classification #8
Comments
FYI @yonromai, this was the embedding model I had in mind when we last spoke: https://huggingface.co/michiyasunaga/BioLinkBERT-base. That's from a top-tier group in the NLP space and it's the model submitted by the first author (michiyasunaga) on LinkBERT: Pretraining Language Models with Document Links (Mar. 2022). The reported improvements on a recent SOTA model (PubMedBERT) are substantial, so it might be worth kicking the tires on it. And to be clear, I have no allegiances to this over a LLaMA-derived model, OpenAI or some KG-based approach. Any performance baseline using embeddings would be helpful. |
cc: @eric-czech @dhimmel TL;DR:
Details:Idea behind the new features:
Outcome:
Comments:
Experiment results:With embedding features
Without embedding features
|
Nice @yonromai!
I think this is clear above, but that does not include GPT-4 assignments of the labels as features (i.e. from #6) correct?
I think it would be ok to include the embeddings or a reduction on them (e.g. PCA) as features directly. I like the tree/clustering approach, but my hunch is that it will be hard to show an improvement over that simpler method.
Do you have a sense of how much macro F1 averages vary across resamplings (e.g. with |
That's right, I think we should do that next.
Sure, I'll give it a try!
That's a great question! I just started using ROC AUC & MAE's from #9 to look into the performance of the model. I'll spend a little bit of time in notebook land looking at how features & model parameters influences metrics. |
Awesome, sounds good! So we're clear though, I'm proposing that we compare distributions of F1, ROC, MAE, etc. scores between models where the distributions come from multiple evaluations of those metrics for different folds. Given that this dataset is small, I think we'll need that help understand what changes are significant. Would you agree? |
Yes totally agree, the idea is to "repeat the experiment" of training the same model on different (stratified) folds of the training set to get an idea of the spread of metrics. Then we can use this spread to have an idea of the significance of the metrics calculated once we change the model/features. Is that what you meant? |
Indeed 👍 |
Okay so after some time in notebook land here is gist of what I found out: TL;DR@eric-czech both of your hunches were 💯 :
@dhimmel Implementing the MAE (with the class biases suggested by @eric-czech) has proven very useful! Some results:
More details about the best performing model (/ Food for thoughts)The model seems to max out on The model seems to slightly overfit the objective function: More details..For more details about the experiments & findings, take a look at the notebook. All the (non-production ready) code is in my branch. |
@eric-czech I think now I have enough understanding about the performance of the pre-GPT model that it'd be worth running the training data through the GPT4 prompt you provided and see if that does better than the model out of the box! I can run some estimation of the cost of the procedure if useful. |
Nice finding that PCA is working better on the node text embeddings than KNN and that 64 dimensions captures much of the performance benefit.
I'm excited to see how the GPT4 features perform! |
Very nice @yonromai! Great experimental setup and it's excellent to see some clear separation between those models.
For posterity, I think it would be helpful to say more about what the Noting the current details in the notebook: ![]()
Awesome -- I'd love to see how it performs on its own and when included as a feature with the other
OOC what is that UI you're looking at there? I don't see any obvious hints in https://github.com/related-sciences/nxontology-ml/tree/romain/embeddings/experimentation. |
@eric-czech Noted, I'll add more details in the notebook. (The LDA code directly applies Sklearn's LDA, similar to the PCA - see this code) I'd like to cleanup the code I have in my branch and merge it into the main branch. I'm probably going to end up deleting a lot of the code (e.g. the KNN part) in the near future but that way it'll be saved in git history (along with the experimental setup). @dhimmel: Would that fine with you? (It's gonna be quite a big PR :( )
The code which displays the model metrics is in the "CatBoost's |
Yes sounds good. |
Sometimes this will render in nbviewer, but not in this case. |
Are there short terms plans to work on this or is it appropriate to close this issue? |
Given that GPT assignments were inferior to text embedding features and didn't add much when combined, I don't think we need to use GPT features at all. Saves on cost and complexity. |
I definitely agree. Noting #34 (comment) as the most recent experiment at TOW that still had these features. |
@yonromai suggested trying to embed the text descriptions, labels, aliases, etc. associated with EFO terms and using those for embeddings as a part of #2.
It would be very interesting to see how a model like the one in #7 improves with embedding features by comparison to a model with only the few-shot labels in #6.
The LLM-derived features will definitely be harder to maintain/generate, but on the other hand I know the labels we provided in #5 are not perfect and I expect that the few-shot features will be more helpful in figuring out which ones are most likely to be mislabeled and why (since they can be directly compared). Nevertheless, contrasting the predictive value of the two could potentially be an important determining factor for how this project, or at least #2 , evolves.
The text was updated successfully, but these errors were encountered: