Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5917: [Java] Redesign the dictionary encoder #4994

Closed
wants to merge 1 commit into from

Conversation

liyafan82
Copy link
Contributor

The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice:

  • There are repeated conversions between Java objects and bytes (e.g. vector.getObject).
  • Unnecessary memory copy (the vector data must be copied to the hash table).
  • The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
  • The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
  • The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.
    We plan to implement a new one in the algorithm module, and gradually deprecate the current one.

@codecov-io
Copy link

codecov-io commented Aug 2, 2019

Codecov Report

Merging #4994 into master will increase coverage by 1.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4994      +/-   ##
==========================================
+ Coverage   88.72%   89.75%   +1.02%     
==========================================
  Files         945      696     -249     
  Lines      124147   105233   -18914     
  Branches     1437        0    -1437     
==========================================
- Hits       110146    94447   -15699     
+ Misses      13639    10786    -2853     
+ Partials      362        0     -362
Impacted Files Coverage Δ
r/src/recordbatch.cpp
r/R/Table.R
go/arrow/math/uint64_amd64.go
r/src/symbols.cpp
r/R/Column.R
go/arrow/ipc/file_reader.go
js/src/builder/index.ts
r/src/arrow_types.h
js/src/enum.ts
r/src/array_from_vector.cpp
... and 241 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f57ba1...315a8ea. Read the comment docs.

@emkornfield
Copy link
Contributor

@liyafan82 I think this could be useful in practice but most of the points you made I don't think are true any longer (using getObject, the hash-table only stored indices and hash-code) or are an easy fix to the existing encoder by migrating from static methods to object wrapper (and we could also pass in a custom hasher).

Would you mind writing up your intended migration plans and sending to the ML, I think this is worth a discussion because it potentially affects code organization strategy going forward.

Also, I believe the additional encoder has a few more unit tests, it would be nice to make use of those for this class.

@liyafan82
Copy link
Contributor Author

liyafan82 commented Aug 9, 2019

@liyafan82 I think this could be useful in practice but most of the points you made I don't think are true any longer (using getObject, the hash-table only stored indices and hash-code) or are an easy fix to the existing encoder by migrating from static methods to object wrapper (and we could also pass in a custom hasher).

Would you mind writing up your intended migration plans and sending to the ML, I think this is worth a discussion because it potentially affects code organization strategy going forward.

Also, I believe the additional encoder has a few more unit tests, it would be nice to make use of those for this class.

@emkornfield
Thanks for the comments. I will use the tests for this class.

IMO, the new design of the dictionary encoder includes at least two types of encoders:

  1. a search based encoder which encode each element in O(logn) time, but requires no extra memory space.
  2. a hash table based encoder which encode each element in O(1) time, but requires extra memory space.
    I think each of the above encoders has its own merit.

In this PR, however, I only implement the search based encoder, because:

  1. small changes are easier to review
  2. the hash table based encoder was blocked by ARROW-6030.

I will open a separate issue for the hash table based encoder.
When that is finished, we will have something to start with about the migration plan.

What do you think?

@tianchen92
Copy link
Contributor

@liyafan82 I think what Micah mean is that the current encoder implementation is based on hash table the the questions #1 #2 #3 #5 mentioned above are no longer problems anymore :).

@liyafan82
Copy link
Contributor Author

liyafan82 commented Aug 9, 2019

@liyafan82 I think what Micah mean is that the current encoder implementation is based on hash table the the questions #1 #2 #3 #5 mentioned above are no longer problems anymore :).

@tianchen92 Maybe it is interesting to examine the above mentioned problems now.

It seems problems 3) and 4) still exists.

To see problem 3), note that since the encoder is based on a static method, for each vector to encode, we must create a new DictionaryHashTable and populate it with the dictionary vector. This is redundant work.

In addition, there is another problem with this encoder: the hash algorithm is fixed, which cannot be adjusted according to the needs of specific scenarios. This may lead to substantial performance penalty, since the dictionary data distribution has great influence on the performance of the hash table.

@tianchen92
Copy link
Contributor

@liyafan82 I think what Micah mean is that the current encoder implementation is based on hash table the the questions #1 #2 #3 #5 mentioned above are no longer problems anymore :).

@tianchen92 Maybe it is interesting to examine the above mentioned problems now.

It seems problems 3) and 4) still exists.

To see problem 3), note that since the encoder is based on a static method, for each vector to encode, we must create a new DictionaryHashTable and populate it with the dictionary vector. This is redundant work.

In addition, there is another problem with this encoder: the hash algorithm is fixed, which cannot be adjusted according to the needs of specific scenarios. This may lead to substantial performance penalty, since the dictionary data distribution has great influence on the performance of the hash table.

Thanks for your clarify, you are right for the two points above.

@emkornfield
Copy link
Contributor

My point for issues 3, 4 and 5, I think they are easily addressable by migrating the static method to use an object based approach (and changing the APIs slightly).

I will open a separate issue for the hash table based encoder.
When that is finished, we will have something to start with about the migration plan.

I think a separate PR is a good idea. The state I'm trying to avoid is having duplicated/very similar functionality in two packages without clear guidance on which one to use. If we do decide to intend to migrate away from the existing encoder embedded in the Vector module there should at least be some notice/discussion on the ML.

/**
* A flag indicating if null should be encoded.
*/
protected final boolean encodeNull;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved. thanks.

* Constructs a dictionary encoder.
* @param dictionary the dictionary. It must be in sorted order.
* @param comparator the criteria for sorting.
* @param encodeNull if null should be encoded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs more clarification, what happens if it is true, what happens if it is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Revised. Thanks.


/**
* Encodes an input vector by binary search.
* So the algorithm takes O(nlogm) time, where n is the length of the input vector,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* So the algorithm takes O(nlogm) time, where n is the length of the input vector,
* So the algorithm takes O(n*log(m)) time, where n is the length of the input vector,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thanks.

* @param input the input vector.
* @param output the output vector.
*/
public void decode(E input, D output) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to not keep this as a static method on the existing VectorEncoderClass, if we want to have a separately allocated vector that seems like an easy change?

Copy link
Contributor Author

@liyafan82 liyafan82 Aug 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the existing DictionaryEncoder class does not support the encodeNull flag. In additin, for that method, the output vector is used as the return value, not a parameter. Do you want me to add a new static method there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In response to this comment, I have extracted a static utility method for the decode functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the decode method at all here, it seems like we can rely on the one in vector. Per above comment, I don't understand the purpose of the encodeNull flag on decode is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the encodeNull is discussed above.
You mean we can rely on the one in the static method?

public void decode(E input, D output) {
for (int i = 0; i < input.getValueCount(); i++) {
if (!encodeNull && input.isNull(i)) {
BitVectorHelper.setValidityBit(output.getValidityBuffer(), i, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this relies on the array being preallocated to the correct size, is that intended behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thanks a lot. This is not intended.
I have revised the code by repeated calling the realloc method. Maybe this solution is not good enough. We may need to improve it in the future by changing APIs.

*/
public void decode(E input, D output) {
for (int i = 0; i < input.getValueCount(); i++) {
if (!encodeNull && input.isNull(i)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be here? it seems like always want to set null if the input happens to be null?

I wonder if it is quicker to copy the null buffer if there are any null values instead of doing this one by one.

Copy link
Contributor Author

@liyafan82 liyafan82 Aug 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we do not always set null to tht output, if the input is null. We can treat null as a special element value, which takes 1 bit in the vilidity buffer, and 0 bit in the data buffer. Since it takes little space, we can use it to encode an extremely frequently used element to improve performance.

I agree that it is much faster to copy the validity buffer directly, if null encoding is disabled.

Thanks for the good suggestion.

@liyafan82
Copy link
Contributor Author

My point for issues 3, 4 and 5, I think they are easily addressable by migrating the static method to use an object based approach (and changing the APIs slightly).

I will open a separate issue for the hash table based encoder.
When that is finished, we will have something to start with about the migration plan.

I think a separate PR is a good idea. The state I'm trying to avoid is having duplicated/very similar functionality in two packages without clear guidance on which one to use. If we do decide to intend to migrate away from the existing encoder embedded in the Vector module there should at least be some notice/discussion on the ML.

@emkornfield I am wondering how you can change the static method to an object based approach without significantly changing the API.

I have started a discussion in the ML. Please take a look.

emkornfield pushed a commit that referenced this pull request Aug 14, 2019
…g it easy to extend and reuse

Related to [ARROW-6194](https://issues.apache.org/jira/browse/ARROW-6194).
As discussed in #4994.
Current static DictionaryEncoder has some limitation for extension and reuse.
Slightly change the APIs and migrate static method to object based approach.

Closes #5055 from tianchen92/ARROW-6194 and squashes the following commits:

2354c48 <tianchen> move build table logic to constructor
7b76cba <tianchen> add test
513f417 <tianchen> add static method back
fb2d293 <tianchen> ARROW-6194:  Make DictionaryEncoder non-static making it easy to extend and reuse

Authored-by: tianchen <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
output.copyFromSafe(index, i, dictionary);
}

// make sure the output vector is large enough
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is everything after the loop needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in response to your previous comment: if the encodeNull flag is disabled. The validity buffer for the input/output vectors are identical. So it is faster to copy the validity buffer entirely.
In addition, we must make sure the validity buffer is large enough, so we have the reallocation calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the output.copyFromSafe enough to ensure the validity buffer size is big enough. won't skipping over indices above implicitly encode them as null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the extreme case where the input vector is totally null, the copyFromSafe may never be called. So the output vector's capacity is not large enough to hold the validity bits.

This code should be placed inside the if statement.

public static <E extends BaseIntVector, D extends ValueVector> void decode(
E input, D output, D dictionary, boolean encodeNull) {
for (int i = 0; i < input.getValueCount(); i++) {
if (!encodeNull && input.isNull(i)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think I understand why encodeNull has any part in this logic. If the input vector has a null value, we should honor that.

Copy link
Contributor Author

@liyafan82 liyafan82 Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have some analysis over the encodeNull flag:

  1. When the encodeNull is disabled:

There is no memory saving, since we are using a int vector, for which a null value also takes space.
The output element is set to null directly, and so there is no overhead for searching for the key in the dictionary.

  1. When the encodeNull is enabled:

We need to search for the null element in the dictionary.
For each iteration of the encode/decode loop, there is no branch statement, and there is no need to check the validity bit. This leads to performance gain.

What do you think?

For cases where null values are rare (or no nulls at all), the performance gain by enabling encodeNull can be significant, because otherwise, we need to do the checking for each encode/decode iteration.

That being said, it seems weird to ever have to encode a null. So if you insist, we can remove it.

However, for other types of encoding, like Huffman encoding/arithmetic encoding, encoding null can be really beneficial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my issue is encodeNull is providing should be intrinsic to the vector (whether there are any nulls are present). So either this there should be a call to getNullCount which right now isn't O(1), so it wouldn't be great performance wise, but its the best we have. I think it is hard to make the guarantee that the indices will all be all non-null even though you are passing in the flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Thanks for the great suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, please hold. My current code is not correct.

Whether we should encode a null should not depend on if there is a null in the dictionary.
An encode null flag should be enforced from outside.

To see this, note that if there is a null in the input and there is no null in the dictionary,

  1. if encodeNull = true, we should throw an exception (the null is unexpected)
  2. if encodeNull = false, we should simply set the output element to null.

I will change the code accordingly when I have a time.
Thanks a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we do not have the external flag, when the encoder encounters a null, it has only one choice: simple output a null.
However, this behavior can be incorrect: if the encoder has encodeNull set, it should never see a null in the input. So the correct behavior should be throwing an exception.

My argument is for a general purpose library we shouldn't be picky at this level. If you have a system invariant built on top of this library, the place to do the validation is at the higher level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield . I see your points and agree with your points. Thanks for the clarification.

It seems the above discussions are getting more complicated than necessary. So I want to have a brief summary, and clear things out.

About the encodeNull flag, it simply determines encoding/decoding behaviors: should it simply outputs a null, or search for the null in the dictionary.

The benefits of turning on the encodeNull flag:

  1. No need to check the validity bit in each iteration.
  2. Removes the if statement in the loop

The shortcoming of turning on the encodeNull flag:

  1. it searches the null value in the dictionary, which may lead to performance overhead.

So what is your suggestion? Should we keep the flag, or remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my suggestion is to not include decoding in this code at all, but centralize it at one place.

Also, ss it stands now, it still doesn't seem like the code in the vector package is on the path to deprecation since @tianchen92 is continuing to put functionality there?

Copy link
Contributor Author

@liyafan82 liyafan82 Sep 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my suggestion is to not include decoding in this code at all, but centralize it at one place.

Sounds reasonable. I will revise the code accordingly later.

Also, ss it stands now, it still doesn't seem like the code in the vector package is on the path to deprecation since @tianchen92 is continuing to put functionality there?

Sure. No problem. We will do it only when the timing is right.

Copy link
Contributor Author

@liyafan82 liyafan82 Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield
I have removed the encode API from the encoder class, and the EncodeUtil class is also removed.
In the tests, the decode functionality is based on DictionaryEncoder#decode static method in the vector module.
Please take a look.

@liyafan82 liyafan82 force-pushed the fly_0712_encode branch 2 times, most recently from 6356468 to e4a5295 Compare August 16, 2019 11:53
@liyafan82 liyafan82 closed this Aug 16, 2019
@liyafan82 liyafan82 reopened this Aug 16, 2019
// vectors are identical, so we can improve performance by copying
// the buffer directly.
if (!encodeNull) {
// make sure the output vector is large enough
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this replicate code in setValue count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thank you.
setValueCount will do the necessary reallocate.

output.reAlloc();
}

PlatformDependent.copyMemory(input.getValidityBuffer().memoryAddress(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would already be set in the loop above though or in setValueCount if all values were null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the output buffer is a fresh vector, all its validity bits are clear by default.
If it is used from a previous run, its validity bits are in an undefined state.

To make it safe, we should copy the validity buffer from input to output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is right, we should either require the input vector is a new state or call allocateNew on it before starting the decode process (see docs on ValuteVector.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good points. I agree with your points in general.
Each vector has its own life cycle, and by respecting the life cycle, the application code can be easy to maintain and reason about.

I have revised the code, and made the requirements explicit in the javadoc.

@liyafan82 liyafan82 force-pushed the fly_0712_encode branch 3 times, most recently from 7648fdc to c9cb12f Compare August 20, 2019 05:45
}

@Test
public void testEncodeLargeVector() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this test testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tests the scenario where the vector to encode is large. In this test, the vector has a length of 10000.
This test was mainly copied from a test from the vector module, according to one of your previous suggestion.

}
}

@Test(expected = IllegalStateException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assertThrows instead of this annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this should be IllegalArgumentException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assertThrows instead of this annotation.

Revised. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this should be IllegalArgumentException.

Sounds good. Thanks.


// the encoder should encode null, but no null in the dictionary
// an exception should be thrown.
SearchDictionaryEncoder<IntVector, VarCharVector> encoder =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its a problem that the dictionary doesn't have a null in it unless, the vector getting encoded does have a null in it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.
The related logic is removed from the constructor, and if there is a null in the vector to encode, an exception will be thrown when encoding.


int index = VectorSearcher.binarySearch(dictionary, comparator, input, i);
if (index == -1) {
throw new IllegalStateException("The data element is not found in the dictionary: " + i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalArgument is potentially better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Revised.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes, left a few more comments and then i think this can be merged.

@emkornfield
Copy link
Contributor

+1, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants