Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split dataset definitions into individual files #145

Merged
merged 13 commits into from
Nov 6, 2024

Conversation

zqhuang211
Copy link
Contributor

  • Separated dataset definitions from ultravox/data/registry.py into individual files for each dataset.
  • Updated all datasets to use transcription as the default assistant response.
  • Added the CoVoST2 dataset.
  • Minor bug fix.

Zhongqiang Huang added 3 commits November 1, 2024 23:46
@zqhuang211 zqhuang211 requested a review from liPatrick November 2, 2024 04:01
Copy link
Contributor

@liPatrick liPatrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Zhongqiang Huang added 3 commits November 5, 2024 23:58
@zqhuang211 zqhuang211 requested a review from liPatrick November 6, 2024 08:07
Zhongqiang Huang added 3 commits November 6, 2024 00:15
@zqhuang211 zqhuang211 merged commit 29a11dc into main Nov 6, 2024
1 check passed
akshat0311 pushed a commit to jiviai/audio-llm that referenced this pull request Jan 30, 2025
- Separated dataset definitions from `ultravox/data/registry.py` into individual files for each dataset.
- Ensured that `split_type` is set correctly.  
- Updated all datasets to use transcription as the default assistant response.
- Added support for the CoVoST2 dataset.

---------

Co-authored-by: Zhongqiang Huang <[email protected]>
zqhuang211 pushed a commit that referenced this pull request Feb 12, 2025
* Fix typo in README.md (#128)
* [bugfix] Missing enable_fsdp in 70b config (#132)
* Update load warnings (#126)
* Generic datasets with inheritance (#135)
* Switch InterleaveDataset to use weights (e.g., 2.0, 0.5, etc) (#140)
* Break up datasets.py (#141)
* Update registry with more languages commonvoice (#143)
* Split dataset definitions into individual files  (#145)
* Add whisper masking (#146)
* Defining block size in UltravoxConfig, and solving assertions (#157)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants