-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop using stream_unzip() #82
Comments
Hi @DwayneDriskill, Thanks for the report... I think we would like to sort it, but suspect it'll be tricky without a reproducible case. Is there any chance you can create a file with data that can be shared that shows the issue? (Also, suspect the issue is most likely to be in https://github.com/michalc/stream-inflate, since it handles the Deflate64 aspect. Maybe it could be changed to detect if it gets stuck in an infinite loop? Not sure...) Thanks, Michal |
Two other things...
Does the input iterable definitely not progress during this? (Wondering if it's extremely poor performance, rather than an infinite loop) And this code I realise is unexpected: with open(file_path, 'rb') as f:
unzipper = stream_unzip(f) This works (er, other than the issue you're encountering), because Instead, can you try this and see if you get the same behaviour? with open(file_path, 'rb') as f:
unzipper = stream_unzip(iter(lambda: f.read(65536), b'')) It'll make each input chunk 64k, ignoring where |
I am starting to suspect extremely poor performance when unzipping deflate64 - taking approximately 1000 the amount of time in some cases... This is from #83, which (just for investigative purposes) uses https://github.com/michalc/stream-inflate for everything, which in the published stream-unzip is only used for defalte64 files. The test_infozip_zip64_with_descriptors test specifically takes much much longer |
Hi Michal, You were right that it seems to be very poor performance. I made the change you suggested and let it run overnight. It finished in about 8.5 hours. When I unzip it with the Windows 11 built in tool it just takes about a minute. Dwayne |
Ah good to know... in a way at least! Now to solve the performance problem... |
Hi @DwayneDriskill / @ddriskillcssi, To give an update on this, have made some performance improvements to https://github.com/michalc/stream-inflate. From some rough tests, v0.0.18 increases the speed of unzipping deflate64 files about 50% compared to v0.0.15. I know this is still pretty slow, especially for larger files. Might need some sort of more drastic re-architecture... or maybe throwing Cython at it... Michal |
And some more speed improvements in v0.0.19. Actually for "easy" deflate64 files it increases speed by a factor of 50 or so compared to v0.0.15. (Easy files would be roughly those with a lot of repetition close in the file) |
A few more speed improvements done in stream-inflate (which is now up to 0.0.23) |
Some more speed improvements, now mostly powered by compiling stream-inflate with Cython. stream-inflate now up to version 0.0.29 |
So with the latest version of stream-inflate, v0.0.34, compressing the JSON 6GB UK Tariff file using Deflate64 then takes about 15 minutes to uncompress on my machine using stream-unzip, hovering between uncompressing at a rate of around 5MB/s and sometimes going to 15MB/s or a bit higher. So probably fast enough to not think it's stuck in an infinite loop, but still way slower than for regular Deflate, which goes beyond 1GB/s on my machine |
Hi Michal,
Sorry for the late reply. I'm guessing and hoping you're doing this work
because you need it for yourself. I coded a work around by unzipping the
file using 7 zip and rezipping it using deflate instead of deflate64. This
solved my problem and performs well so I'm no longer using stream-inflate.
Thanks,
Dwayne
…On Tue, Aug 27, 2024 at 1:34 AM Michal Charemza ***@***.***> wrote:
Hi @DwayneDriskill <https://github.com/DwayneDriskill> / @ddriskillcssi
<https://github.com/ddriskillcssi>,
To give an update on this, have made some performance improvements to
https://github.com/michalc/stream-inflate. From some rough tests, v0.0.18
increases the speed of unzipping deflate64 files about 50% compared to
v0.0.15. I know this is still pretty slow, especially for larger files.
Might need some sort of more drastic re-architecture... or maybe throwing
Cython at it...
Michal
—
Reply to this email directly, view it on GitHub
<#82 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABP4VUDLOLXRL7EZQ7ABWJDZTQFXPAVCNFSM6AAAAABGTASMWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGYYDQMRYHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@DwayneDriskill Ah good to know! I'm going to keep this open because I suspect there are still lots of speed improvements to be made, but not expecting replies or anything really - will close when I judge done enough / not much more can be done. I don't exactly need this myself at the moment, but it's more to make stream-unzip be able to open "any" (within reason...) ZIP file. And maybe a bit of an exercise in optimising Python code. |
Hi,
I am using stream-unzip 0.0.91 to unzip a file that was zipped using the deflate64 algorithm. The file was zipped using Windows 11's built in zipper/unzipper. I am able to unzip the file using Windows 11's built in zipper/unzipper. The files look fine after being unzipped so I don't think the files are corrupt. Unfortunately, I cannot share the file because it contains data that I am not allowed to share. I can tell you that the zipped file contains 7 .json files. 6 of the files unzip without any issues. stream_unzip() gets stuck in an infinite loop when trying to unzip the 7th file, which is over 2GB. I confirmed this with logging and by letting the program run for several hours to confirm that the size of the unzipped file was not increasing and that stream_unzip() was iterating over the same code. I stepped into the code for stream_unzip(). But I don't have time to understand the code well enough myself to get to the point where I can pinpoint the issue and it's using yield in several places, and I have never used yield myself. My code is below. Perhaps I'm making a mistake in how I'm using stream_unzip().
def _unzip_file_using_stream_unzip(self, file_path, temp_dir):
with open(file_path, 'rb') as f:
seen_files = set() # to search for duplicate file names. This is specific to my use case. Theoretically, a zip file can have 2 files with the same name in different folders and my code doesn't handle that scenario.
unzipper = stream_unzip(f)
for file_info in unzipper:
file_name, file_size, file_content_gen = file_info
file_name = file_name.decode('utf-8')
if not file_name.endswith('/'): # I borrowed this code from a fxn I wrote to unzip files using the python standard library. I'm using Python 3.11. This identifies a folder. I'm not sure if stream_unzip() works the same way. But the code is not getting stuck here.
if os.path.basename(file_name.lower()) in seen_files:
raise RuntimeError("Error: There are multiple files named '" +
f"{os.path.basename(file_name)}' in {file_path}.")
seen_files.add(os.path.basename(file_name.lower()))
extracted_file_path = os.path.join(temp_dir, os.path.basename(file_name))
with open(extracted_file_path, 'wb') as out_file:
buffered_writer = io.BufferedWriter(out_file, buffer_size=65536)
# I proved with logging that when PROD_04182024_EXPORTS.zip is passed in,
# this code gets stuck in an infinite loop (in the for chunk loop). 6 of the 7 files
# are successfully unzipped. The 7th file is the one that causes the infinite loop.
# The following 3 lines execute in an infinite loop. The value of chunk changes during each iteration over the loop. I did not confirm the following - but I assume that at some point, the value of chunk gets set back to the original value that it's set to the first time it enters the infinite loop.
for chunk in file_content_gen:
buffered_writer.write(chunk)
buffered_writer.flush() # ensure all data is written to disk. This might not be necessary. I added after I identified the infinite loop hoping that this would fix it.
self._logger.info("Finished unzipping %s", file_name) # This executes for the 1st 6 files. But not the 7th file.
The size of the file that contains the unzipped data got stuck at 57,222kb. I'm assuming the loop is infinite. The file unzips using Windows 11 in less than 5 minutes. I let the program run for at least 2 hours before I stopped it. I also tried running the code in a Docker container running linux and also in an aws batch process (also running linux) with the same results.
I paused the program in PyCharm and copied the call stack:
_paginate, stream_inflate.py:315
_run, stream_inflate.py:322
_decompress, stream_unzip.py:146
decrypt_none_decompress, stream_unzip.py:281 (279)
_iter, stream_unzip.py:295
checked_from_local_header, stream_unzip.py:304
_unzip_file_using_stream_unzip, aws_utils.py:474
unzip_file, aws_utils.py:432
load_new_files, aws_utils.py:279
I noticed that _run() was also calling other methods at different points such as _get_bit(), inflate(get_bits, get_bytes, yield_bytes), and get_bits(num_bits).
I noticed that some exceptions were being thrown and caught:
def up_to_page_size(num):
nonlocal chunk, offset
Another exception:
def get_byte_readers(iterable):
# Return functions to return/"replace" bytes from/to the iterable
# - _yield_all: yields chunks as they come up (often for a "body")
# - _get_num: returns a single
bytes
of a given length# - _return_num_unused: puts a number of "unused" bytes "back", to be retrieved by a yield/get call
# - _return_bytes_unused: return a bytes instance "back" into the stream, to be retrieved later
# - _get_offset_from_start: get the zero-indexed offset from the start of the stream
This is not an urgent issue for me. But I thought you would want to know. I'm happy to help debug the issue if it's important enough for you to spend time on it.
Thanks,
Dwayne
The text was updated successfully, but these errors were encountered: