This is the embedding training code for the scientific embedding project: [2405.11461] DocReLM: Mastering Document Retrieval with Language Model (arxiv.org).
The dataset is from the Synthetic data build via veya2ztn/Synthetic-Science: Those script try to create Synthetic Science QA answer-question pair efficiently and reasoning (github.com) based on veya2ztn/uparxive: llm-friendly dataest for the whole arxiv .tex source. (github.com)
This repo integrate embedder training method
- ART
- SGPT
- Finetune
- Pipline training
- Tensor Parallel: 1D, 2D and so on
- Gradient Cache
- Qlora
- Uniem