Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tar stream closes early on large stream (44+Gb) #111

Open
MorningLightMountain713 opened this issue Sep 16, 2024 · 11 comments
Open

Tar stream closes early on large stream (44+Gb) #111

MorningLightMountain713 opened this issue Sep 16, 2024 · 11 comments

Comments

@MorningLightMountain713
Copy link
Contributor

Hi there,

I'm having an issue with tar-fs and I don't even know where to start debugging it. Any pointer on how I can debug this would be great.

It could even have something to do with the underlying system running out of memory and swapping. (Plenty of swap space though) It quite often ends up swapping, but this shouldn't cause the stream to end.

It's pretty consistent, I get around 30-40Gb transferred, then the stream just ends - with no errors raised.

Here is how I am using it:

const workflow = [];

workflow.push(tar.pack(base, {
  entries: folders,
}));

workflow.push(res); # res is an express writeableStream

const pipeline = promisify(stream.pipeline);
const error = await pipeline(...workflow).catch((err) => err);

I addd a stream.PassThrough() in there to see how many bytes are being written - and when the stream ends, it hasn't written all the bytes.

@mafintosh
Copy link
Owner

does the pipeline end with no error?

@MorningLightMountain713
Copy link
Contributor Author

MorningLightMountain713 commented Sep 16, 2024

does the pipeline end with no error?

Yes - no error when the pipeline ends. Also the tar is not corrupt - It's just missing files. Is there a way to get errors out per file? If there is an issue does the file get skipped?

Edit - also worth noting, quite often it seems like it is a subdirectory that gets skipped, which in this case - is 11Gb.

Thanks

@mafintosh
Copy link
Owner

If you only tar that one folder does it work? ie anything you can do to reduce the problem down helps us fix it. try not using any http stream also, just tar-fs on that folder and see if thats broken also

@MorningLightMountain713
Copy link
Contributor Author

MorningLightMountain713 commented Sep 16, 2024

If you only tar that one folder does it work? ie anything you can do to reduce the problem down helps us fix it. try not using any http stream also, just tar-fs on that folder and see if thats broken also

Yes - that folder does tar successfully.

Okay so I've narrowed it down somewhat.

I have a feeling it is some sort of timing issue and something to do with the subdirectory getting walked and stat ran for each file, which I'm sure takes time - being that it is 5.5k files and 11Gb.

If I add the subdirectory as an entry before the main directory, the entire tar get sent successfully. (I've only tested it once so far, but will continue testing)

Some examples:

Here is how I'm adding the directories:

    const folders = [
      'blocks',
      'chainstate',
      'determ_zelnodes',
    ];

    workflow.push(tar.pack(base, {
      entries: folders,
    }));

If the entries are added in this order (they are all directories with files) then quite often, but not always, the index dir is empty - note there is an index directory inside the blocks directory:

(.venv) davew@beetroot:~/zelflux$ tar tf flux_out.tar
blocks
chainstate
determ_zelnodes

If the entries are added in this order, it works fine and files / dirs are only added once:

(.venv) davew@beetroot:~/zelflux$ tar tf flux_out.tar
blocks/index
blocks
chainstate
determ_zelnodes

if the entries are added in this order, the files can sometimes be doubled up, depending on if the blocks entry loads the files from the index subdir:

(.venv) davew@beetroot:~/zelflux$ tar tf flux_out.tar
blocks
blocks/index
chainstate
determ_zelnodes

I'll try eliminate the http stream and see what happens

@MorningLightMountain713
Copy link
Contributor Author

Some info on what is being tarred

root@banana:/home/davew/.flux# du -h blocks chainstate determ_zelnodes
11G	blocks/index
34G	blocks
415M	chainstate
7.6G	determ_zelnodes
root@banana:/home/davew/.flux# find . -type f | wc -l
10191
root@banana:/home/davew/.flux#

@mafintosh
Copy link
Owner

nice, thanks, so you are saying its when the folder with tons of small files are at the end it becomes an issue?

@MorningLightMountain713
Copy link
Contributor Author

MorningLightMountain713 commented Sep 16, 2024

Yes, it seems the ordering is important, I get a good result if I put folders with lots of small files, before folders with big files.

As well as explictly including child folders as entries with lots of files before the parent folder.

Tested this several times now - getting good results by changing the order.

Edit: This is the ordering I'm using now:

    const folders = [
      'determ_zelnodes',
      'blocks/index',
      'chainstate',
      'blocks',
    ];

@MorningLightMountain713
Copy link
Contributor Author

I was wrong with the above statement. If I add both the blocks/index and blocks folders - it sometimes adds blocks/index twice.

So as far as I can tell I'm back to the initial timing issue... sometimes it adds the index folder, sometimes it doesn't.

Here is more detail about the blocks folder.

175 dat files, each 128Mb
index folder ~5500 2.1Mb files
175 dat files, ranging from 4Mb -12Mb

<snip>
-rw------- 1 davew davew 128M Jul 19 02:56 blk00167.dat
-rw------- 1 davew davew 128M Jul 27 01:53 blk00168.dat
-rw------- 1 davew davew 128M Aug  4 06:36 blk00169.dat
-rw------- 1 davew davew 128M Aug 12 13:01 blk00170.dat
-rw------- 1 davew davew 128M Aug 20 16:40 blk00171.dat
-rw------- 1 davew davew 128M Aug 28 20:28 blk00172.dat
-rw------- 1 davew davew 128M Sep  5 21:38 blk00173.dat
-rw------- 1 davew davew 128M Sep 13 23:51 blk00174.dat
-rw------- 1 davew davew  48M Sep 16 21:46 blk00175.dat
drwx------ 2 davew davew 188K Sep 16 21:51 index
-rw------- 1 davew davew 6.3M May 24  2022 rev00000.dat
-rw------- 1 davew davew 5.6M May 24  2022 rev00001.dat
-rw------- 1 davew davew 7.2M May 24  2022 rev00002.dat
-rw------- 1 davew davew  11M May 24  2022 rev00003.dat
-rw------- 1 davew davew 7.5M May 24  2022 rev00004.dat
-rw------- 1 davew davew 7.9M May 24  2022 rev00005.dat
-rw------- 1 davew davew 9.0M May 24  2022 rev00006.dat
-rw------- 1 davew davew 8.8M May 24  2022 rev00007.dat
<snip>

@mafintosh
Copy link
Owner

If you run this

const stream = tar.pack(your-options)

let n = 
stream.on('data', function (data) {
  n += data.byteLength
})
stream.on('end', function () {
  console.log(n)
})

Does it give a different result everytime?

@MorningLightMountain713
Copy link
Contributor Author

I get the same result each time.

davew@banana:~/zelflux$ node test_stream.js
44020073472
davew@banana:~/zelflux$ node test_stream.js
44020073472
davew@banana:~/zelflux$ node test_stream.js
44020073472
davew@banana:~/zelflux$ node test_stream.js
44020073472
davew@banana:~/zelflux$ node test_stream.js
44020073472

However, something that is important that I forgot to mention, is that there is a service running while the tar is in progress, which can change the size of a few of the files (they have the latest timestamps in the leveldb)

This is what it looks like while the process is running, it's still picking up the index folder.

davew@banana:~/zelflux$ node test_stream.js
44018606592
davew@banana:~/zelflux$ node test_stream.js
44018614784
davew@banana:~/zelflux$ node test_stream.js
44027934720
davew@banana:~/zelflux$ node test_stream.js
44027935744

@mafintosh
Copy link
Owner

ohhhhhhh thats prop the issue, you are prop corrupting the tar as you are producing it then cause the header is written first in tar so if the file shrinks during the stat that creates issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants