All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Some of the banned words were not banned correctly - these are now correctly removed.
- Added postprocessing of corpora, including removal of duplicates, bot comments, and removing comments from inappropriate subreddits.
- Added
--hub-repo-id
to the CLI, which can be used to upload the resulting dataset to the Hugging Face Hub.
- Initial release, which includes the CLI command
build
, which builds the Scandinavian Reddit corpus. Runbuild --help
to see more information.