-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Pipeline remove download file after process and extract single language #56
Comments
Hello and thanks for the question, that's a great suggestion!
IIRC there is some code in Ungoliant that could enable this type of processing. However, I don't think the actual pipeline code is ready for this, and I don't know how much work it would take. There are two approaches here:
I'm currently busy with other things so it might take some time to see this implemented. There's also a third option that could be re-implemented for the latest pipeline, and that is not documented (because not extensively tested): We can provide "rebuild" files (one per language), that basically tells Ungoliant which shards to use and how to rebuild the corpus. These files contains metadata for each document.
While this could be done, you'd still need to do the whole process (identify all corpus, process all shards etc.). I guess that we could provide the option, but it will only help with disk size. |
thanks for reply and the workaround!
can you pinpoint line of code/files that need adjust to make this work?
yes i check this subcommand,,but dont undestrand how to use it ,especially this two arg right now my "dirty" workaround to tackle size problem is by spliting wet.path into multiple files,each files have 2000 line/links for each files i do this step:
thanks again for reply |
Adding something like Feel free to try to implement it! However, to be merged, I think that it would be better to integrate this in the CLI (like For rebuild files, we have not distributed them (yet), and I haven't ported the source code for the latest pipeline. I might do it if you're interested, but it's not top prio right now :( |
hi thank you for releasing tool!
since cc dump very size/disk demanding
can we have optional pipeline step like this:
also
can we make optional pipeline that we can choose which language to process instead of process all of them
maybe something like
ungoliant pipeline download/ src/ -lang id
thank you before
The text was updated successfully, but these errors were encountered: