[Feature request] Pipeline remove download file after process and extract single language #56

acul3 · 2022-07-04T23:11:42Z

hi thank you for releasing tool!

since cc dump very size/disk demanding

can we have optional pipeline step like this:

immediately process(pipeline step) for each file in download command step instead to waiting all file download complete
remove the file immediately after being process (this will save disk ussage)

also
can we make optional pipeline that we can choose which language to process instead of process all of them

maybe something like
ungoliant pipeline download/ src/ -lang id

thank you before

The text was updated successfully, but these errors were encountered:

Uinelj · 2022-07-05T08:19:28Z

Hello and thanks for the question, that's a great suggestion!

immediately process(pipeline step) for each file in download command step instead to waiting all file download complete

IIRC there is some code in Ungoliant that could enable this type of processing. However, I don't think the actual pipeline code is ready for this, and I don't know how much work it would take. There are two approaches here:

Download shards, and trigger their deletion once the shard has been processed (quite easy I think)
Treat shards as streams of data that will never touch the disc, and do the processing on-line.

I'm currently busy with other things so it might take some time to see this implemented.

There's also a third option that could be re-implemented for the latest pipeline, and that is not documented (because not extensively tested): We can provide "rebuild" files (one per language), that basically tells Ungoliant which shards to use and how to rebuild the corpus. These files contains metadata for each document.

can we make optional pipeline that we can choose which language to process instead of process all of them

While this could be done, you'd still need to do the whole process (identify all corpus, process all shards etc.). I guess that we could provide the option, but it will only help with disk size.

acul3 · 2022-07-05T09:27:27Z

thanks for reply and the workaround!

Download shards, and trigger their deletion once the shard has been processed (quite easy I think)

can you pinpoint line of code/files that need adjust to make this work?
maybe i can test/try it in my fork(not expert on rust though)

There's also a third option that could be re-implemented for the latest pipeline, and that is not documented (because not extensively tested): We can provide "rebuild" files (one per language), that basically tells Ungoliant which shards to use and how to rebuild the corpus. These files contains metadata for each document.

yes i check this subcommand,,but dont undestrand how to use it ,especially this two arg
<src-rebuild> <src-shards>
shard == file with *.txt.gz ? or processed *.txt.gz

right now my "dirty" workaround to tackle size problem is by spliting wet.path into multiple files,each files have 2000 line/links

for each files i do this step:

run download command
run pipeline
get my specific json lang (in my case id_meta.jsonl),move it somewhere
remove all files inside download and pipeline folder and back to step 1

thanks again for reply

Uinelj · 2022-07-05T09:58:37Z

Adding something like remove_file at the end of the process_shard function should do it https://github.com/oscar-corpus/ungoliant/blob/master/src/pipelines/oscardoc/pipeline.rs#L221.

Feel free to try to implement it! However, to be merged, I think that it would be better to integrate this in the CLI (like ungoliant pipeline src/ dst/ --del_src or smth like that)

For rebuild files, we have not distributed them (yet), and I haven't ported the source code for the latest pipeline. I might do it if you're interested, but it's not top prio right now :(

acul3 added the enhancement New feature or request label Jul 4, 2022

acul3 assigned pjox Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Pipeline remove download file after process and extract single language #56

[Feature request] Pipeline remove download file after process and extract single language #56

acul3 commented Jul 4, 2022 •

edited

Loading

Uinelj commented Jul 5, 2022

acul3 commented Jul 5, 2022 •

edited

Loading

Uinelj commented Jul 5, 2022

[Feature request] Pipeline remove download file after process and extract single language #56

[Feature request] Pipeline remove download file after process and extract single language #56

Comments

acul3 commented Jul 4, 2022 • edited Loading

Uinelj commented Jul 5, 2022

acul3 commented Jul 5, 2022 • edited Loading

Uinelj commented Jul 5, 2022

acul3 commented Jul 4, 2022 •

edited

Loading

acul3 commented Jul 5, 2022 •

edited

Loading