Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Pipeline remove download file after process and extract single language #56

Open
acul3 opened this issue Jul 4, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@acul3
Copy link

acul3 commented Jul 4, 2022

hi thank you for releasing tool!

since cc dump very size/disk demanding

can we have optional pipeline step like this:

  1. immediately process(pipeline step) for each file in download command step instead to waiting all file download complete
  2. remove the file immediately after being process (this will save disk ussage)

also
can we make optional pipeline that we can choose which language to process instead of process all of them

maybe something like
ungoliant pipeline download/ src/ -lang id

thank you before

@acul3 acul3 added the enhancement New feature or request label Jul 4, 2022
@Uinelj
Copy link
Member

Uinelj commented Jul 5, 2022

Hello and thanks for the question, that's a great suggestion!

immediately process(pipeline step) for each file in download command step instead to waiting all file download complete

IIRC there is some code in Ungoliant that could enable this type of processing. However, I don't think the actual pipeline code is ready for this, and I don't know how much work it would take. There are two approaches here:

  1. Download shards, and trigger their deletion once the shard has been processed (quite easy I think)
  2. Treat shards as streams of data that will never touch the disc, and do the processing on-line.

I'm currently busy with other things so it might take some time to see this implemented.

There's also a third option that could be re-implemented for the latest pipeline, and that is not documented (because not extensively tested): We can provide "rebuild" files (one per language), that basically tells Ungoliant which shards to use and how to rebuild the corpus. These files contains metadata for each document.

can we make optional pipeline that we can choose which language to process instead of process all of them

While this could be done, you'd still need to do the whole process (identify all corpus, process all shards etc.). I guess that we could provide the option, but it will only help with disk size.

@acul3
Copy link
Author

acul3 commented Jul 5, 2022

thanks for reply and the workaround!

Download shards, and trigger their deletion once the shard has been processed (quite easy I think)

can you pinpoint line of code/files that need adjust to make this work?
maybe i can test/try it in my fork(not expert on rust though)

There's also a third option that could be re-implemented for the latest pipeline, and that is not documented (because not extensively tested): We can provide "rebuild" files (one per language), that basically tells Ungoliant which shards to use and how to rebuild the corpus. These files contains metadata for each document.

yes i check this subcommand,,but dont undestrand how to use it ,especially this two arg
<src-rebuild> <src-shards>
shard == file with *.txt.gz ? or processed *.txt.gz

right now my "dirty" workaround to tackle size problem is by spliting wet.path into multiple files,each files have 2000 line/links

for each files i do this step:

  1. run download command
  2. run pipeline
  3. get my specific json lang (in my case id_meta.jsonl),move it somewhere
  4. remove all files inside download and pipeline folder and back to step 1

thanks again for reply

@Uinelj
Copy link
Member

Uinelj commented Jul 5, 2022

Adding something like remove_file at the end of the process_shard function should do it https://github.com/oscar-corpus/ungoliant/blob/master/src/pipelines/oscardoc/pipeline.rs#L221.

Feel free to try to implement it! However, to be merged, I think that it would be better to integrate this in the CLI (like ungoliant pipeline src/ dst/ --del_src or smth like that)

For rebuild files, we have not distributed them (yet), and I haven't ported the source code for the latest pipeline. I might do it if you're interested, but it's not top prio right now :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants