-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exact training corpus #4
Comments
Hi Malte, sure! First off, I want to mention: we initially restricted the dataset size to 4 GiB to make sure we don't need huge amounts of data in the target language, and for practical reasons w.r.t disk space. This stayed there mainly for historic reasons since in the meantime we have dedicated evaluation on low resource languages, making this restriction kind of obsolete. So if you plan to train a new model I'd recommend to train on the full corpus. To answer your question: We used the first 4GiB of e.g. Let me know if you have any more questions! |
Thanks! This is what I was looking for. |
I've followed your instruction and created the 4GB train file which corresponds to 1,700,699 examples (with a custom gpt2-tokenizer trained on the same data). Given the 512 batch size, how do you then end up with 250k training steps? Or are you training for multiple epochs? |
Yes, we're training for many epochs! I just checked, for German it was ~ 75. Also, I've just made the wandb project public which should make it easier if you're aiming to reproduce some results. For example, this is the German WECHSEL GPT2 run from the paper: https://wandb.ai/llms-transfer-learning/main/runs/14txxjm8 (I used this project to internally track everything so it is not very cleaned up). And as I mentioned previously, I would recommend to only use the 4GB limit if you're aiming to reproduce some results from the paper, otherwise I'd go with no limit. |
What is the reason for that many epochs? Did you do any ablation with less epochs on a larger dataset? |
As I mentioned above:
To go into a bit more detail: Back when we started the experiments, Google's Cloud TPU VMs were restricted to 96GB disk space and it was not possible to attach extra disks (thankfully in the meantime we can do this!). Including model checkpoints, there wouldn't have been space for much more data. That's of course not ideal and in hindsight we should've maybe spent some more engineering effort to bypass this restriction. However, we don't see any signs of overfitting even when training for that many epochs, and the results from WECHSEL hold when training on more data (I'm verifying this by training models for Ukrainian just now). I don't know what exactly you're trying to use WECHSEL for. If you're trying to reproduce our results, you can train with the restricted corpus for a large number of epochs as we did in the paper. If you're training a new model for something else, I'd recommend just using as much data as possible. |
Thanks for the clarification. And yes, reproducing your work first and then using it for a new model. |
Hey, I just came across https://github.com/malteos/german-gpt. Awesome work! I guess you were able to reproduce our results. I guess https://huggingface.co/malteos/gpt2-xl-wechsel-german is now the largest public German LM. Do you have a twitter handle / do you mind if I promote it on twitter? |
Probably yes.. or at least I'm not aware of any other model ;) Feel free to promote it https://twitter.com/XYOU |
Hi @bminixhofer
thanks for sharing your work. Could you provide more details on the training corpus?
In the paper, you write
What exact subset do you use? The unshuffled dedub versions (e.g., unshuffled_deduplicated_de)? Any random n samples with a specific seed? Or just the first/last n rows?
Best,
Malte
The text was updated successfully, but these errors were encountered: