Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funetuning for domain-specific data due to recommened input size limitations #166

Open
fif911 opened this issue Feb 2, 2025 · 3 comments

Comments

@fif911
Copy link

fif911 commented Feb 2, 2025

Hey Prior Labs team

First want to thank you for outsourcing this. That's a great foundational model. I have not found any scripts for the finetuning process so my question is:

Does it make sense to fine-tune TabPFN on domain data? I have a dataset with ~70,000 rows, and because the model was trained on synthetic data, I am curious if fine-tuning it on domain data would make sense. Also, it seems that the compute required for such a process is not too big, as you used only two weeks of 8x2080 GPUs for training.

If so, how would you recommend to approach the task?

Additionally, I am curious if TabPFN can leverage textual data. I saw somewhere that AutoGluon tries to fuse textual and tabular data, but I'm not sure how effective that would be. If so, what is the optimal text size based on your experience?

Best regards, Oleksandr

@LeoGrin
Copy link
Collaborator

LeoGrin commented Feb 4, 2025

Hey @fif911 !
Thanks for the kind words! For finetuning on one dataset, there is this script: https://github.com/LennartPurucker/finetune_tabpfn_v2
For finetuning on a set of related datasets from a same domain, we don't have anything public for now but we want to release this quite soon.
Concerning text data, TabPFN v2 leverages text features in the API (https://github.com/PriorLabs/tabpfn-client)!

@fif911
Copy link
Author

fif911 commented Feb 4, 2025

Thanks @LeoGrin ! That helps a lot.

Just for me to fully understand. So the model served via API does support text features, but the outsourced one does not, right?

Cheers, Oleksandr

@LeoGrin
Copy link
Collaborator

LeoGrin commented Feb 4, 2025

The local package supports text features in the sense that it doesn't break if the input contains some, but it treats them as categorical for now, which means performance should be quite lower than on the API if you have rich text features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants