Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'auto' encoded longs + compression serializer #6045

Merged
merged 3 commits into from
Jul 31, 2018

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Jul 26, 2018

Fixes #6044 by changing BlockLayoutColumnarLongsSerializer block flushing to call LongEncodingWriter.setBuffer which will create new serializers.

To help catch further issues like this, the only thing I could think of that would make me feel reasonably confident about stuff was to add a rather exhaustive test that tries a variety of 'bits per value', number of rows, and compression strategies, but it unfortunately adds a minute or two to overall test time.

Running io.druid.segment.data.CompressedLongsAutoEncodingSerdeTest
Tests run: 104, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 58.116 sec - in io.druid.segment.data.CompressedLongsAutoEncodingSerdeTest

It isn't even completely exhaustive, as the behavior of IntermediateColumnarLongsSerializer isn't directly controllable, but both table and delta encodings are exercised. I'm considering marking it @Ignore.

Fixes apache#6044

changes:
* Fixes `VSizeLongSerde` serializers to treat 'close' as 'flush' when used with `BlockLayoutColumnarLongsSerializer`, allowing unwritten values to be flushed to the buffer when the block is compressed
* Add exhaustive unit test that flexes a variety of value sizes, row counts, and compression strategies to catch issues such as these
:
@gianm
Copy link
Contributor

gianm commented Jul 26, 2018

I don't think the test you wrote should be @Ignore -- it is valuable to have some exhaustive tests that we run before releases. Does JUnit have a way to tag tests as "slow" and then only run them either for releases, or when you specifically ask for them?

@jihoonson
Copy link
Contributor

I think it's fine to enable by default if it just takes one or two minutes since this kind of bugs is difficult to catch without various testing.

@jihoonson
Copy link
Contributor

@clintropolis what do you think about adding flush() to LongSerializer?

@clintropolis
Copy link
Member Author

@jihoonson I considered that but it seemed more invasive of a change and I'm trying to deprecate this code path anyway..

That said, it is strange that close is called multiple times, so I'll just go ahead and refactor.

@clintropolis
Copy link
Member Author

clintropolis commented Jul 27, 2018

@jihoonson What if I just rename LongSerializer.close to LongSerializer.flush and if it is tied to an output stream it can only be flushed once, leaving the code structurally as is?

I think alternative involves adding a close method to LongEncodingWriter and a flush method to LongSerializer and changing the hooks into LongEncodingWriter from EntireLayoutColumnarLongs to call close instead of flush

@jihoonson
Copy link
Contributor

@clintropolis I'm fine with it.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest change looks good to me.

@jihoonson jihoonson merged commit 20ae8aa into apache:master Jul 31, 2018
@clintropolis clintropolis deleted the block-longs-auto-fix branch August 6, 2018 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants