-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug with untar where sometimes pax headers are not parsed properly #37
Comments
I analysed this further and it was as I suspected above. The bug occurs when the incoming flow of ByteStrings through the conduit are short, and we get lots of small Here is the "untar, pax interchange format" test that I played around with: Lines 313 to 318 in 96b18aa
I can make this test fail by "chopping" the conduit input bytes such that they flow in a single byte at a time: it "untar, pax interchange format" $ do
res <- runConduitRes $
paxExample
+ .| chop
.| untar process
.| sinkList
pure res `shouldReturn` [("filepath", "payload")] Here is the "chop" function: chop :: (MonadIO m) => ConduitT ByteString ByteString m ()
chop = do
next <- await
case next of
Nothing -> pure ()
Just val -> do
forM_ (splitEvery 1 (S.unpack val)) $ \chunk -> do
yield (S.pack chunk)
chop
splitEvery :: Int -> [a] -> [[a]]
splitEvery n = P.takeWhile (not . P.null) . P.map (P.take n) . P.iterate (P.drop n) With this modification, the test fails with:
I have a fix, I will post it here shortly. |
Here is the fix I came up with. -parsePax :: Monad m => ConduitM TarChunk TarChunk (StateT PaxState m) PaxHeader
-parsePax = await >>= \case
- Just (ChunkPayload _ b) -> pure $ paxParser b
- _ -> pure mempty
+parsePax :: Monad m => ConduitM TarChunk TarChunk (StateT PaxState m) PaxHeader
+parsePax = do
+ payloads <- readAllChunkPayloads []
+ let b = S.concat (reverse payloads)
+ pure $ paxParser b
+
+readAllChunkPayloads :: Monad m => [ByteString] -> ConduitT TarChunk o m [ByteString]
+readAllChunkPayloads accum = do
+ next <- await
+ case next of
+ Nothing -> pure accum
+ Just (ChunkPayload _ b) -> readAllChunkPayloads (b:accum)
+ Just other -> do
+ leftover other
+ pure accum Someone who is better at conduit than me can probably write this in a more elegant way that is more efficient. This fixes the original issue I had. It also fixes the test crash with the "chop" modification I described above in #37 (comment) (note that the test will still fail, but this time for a different reason -- a flaw in the test assertion can that be easily fixed). |
@bitc, I agree fully with your analysis and I have a pull request that reflects it and your solution: #39. My original mistake was not realising that pax extended header data could be provided in more than one sequential chunk. The only thing I changed was that it seemed to me that the accumulator could be of type |
Thank you! The fix in #39 looks perfect and I have tested it in my code and the problem is gone 👍 Thank you again for this feature, pax support in tar-conduit is very useful and valuable 💯 |
Fix #37 Handle multi-chunk pax extended header data
Hello, I am trying out the new pax support added in #34
Thank you, it is a great feature and very much appreciated.
I am using the
untar
function to extract a tar file that contains long pathnames (path=my-long-path
pax records).It works 99% well but occasionally fails to parse the pax header for some files and ends up ignoring the long pathname.
I haven't fully diagnosed this, but the problem seems to be in the
parsePax
function.tar-conduit/src/Data/Conduit/Tar.hs
Lines 496 to 499 in 96b18aa
This function reads chunks containing a ByteString payload, and then calls
paxParser
with this ByteString value.The problem is that sometimes this ByteString value is truncated, for example:
The "70" indicates the length and you can see it got truncated and there is also no trailing "\n" char.
Immediately afterwards my traces show that there is a
ChunkPayload
containing the rest of the header ("ncated.txt\n"
).When
paxParser
is called with the truncated header, the parse fails and the header is ignored, causing errors with the file.So this seems to be an error related to the chunking that is done in conduit, and happens on the rare occasion when a pax header happens to cross a ByteString chunk boundary. When I change the conduit input to be a network stream instead of a file, then I think the input chunking is a bit different, and I get errors in a different pax header entry.
The text was updated successfully, but these errors were encountered: