ARROW-5917: [Java] Redesign the dictionary encoder #4994

liyafan82 · 2019-08-02T09:55:28Z

The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice:

There are repeated conversions between Java objects and bytes (e.g. vector.getObject).
Unnecessary memory copy (the vector data must be copied to the hash table).
The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.
We plan to implement a new one in the algorithm module, and gradually deprecate the current one.

codecov-io · 2019-08-02T12:45:20Z

Codecov Report

Merging #4994 into master will increase coverage by 1.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4994      +/-   ##
==========================================
+ Coverage   88.72%   89.75%   +1.02%     
==========================================
  Files         945      696     -249     
  Lines      124147   105233   -18914     
  Branches     1437        0    -1437     
==========================================
- Hits       110146    94447   -15699     
+ Misses      13639    10786    -2853     
+ Partials      362        0     -362

Impacted Files	Coverage Δ
r/src/recordbatch.cpp
r/R/Table.R
go/arrow/math/uint64_amd64.go
r/src/symbols.cpp
r/R/Column.R
go/arrow/ipc/file_reader.go
js/src/builder/index.ts
r/src/arrow_types.h
js/src/enum.ts
r/src/array_from_vector.cpp
... and 241 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f57ba1...315a8ea. Read the comment docs.

emkornfield · 2019-08-09T04:32:34Z

@liyafan82 I think this could be useful in practice but most of the points you made I don't think are true any longer (using getObject, the hash-table only stored indices and hash-code) or are an easy fix to the existing encoder by migrating from static methods to object wrapper (and we could also pass in a custom hasher).

Would you mind writing up your intended migration plans and sending to the ML, I think this is worth a discussion because it potentially affects code organization strategy going forward.

Also, I believe the additional encoder has a few more unit tests, it would be nice to make use of those for this class.

liyafan82 · 2019-08-09T06:33:55Z

@liyafan82 I think this could be useful in practice but most of the points you made I don't think are true any longer (using getObject, the hash-table only stored indices and hash-code) or are an easy fix to the existing encoder by migrating from static methods to object wrapper (and we could also pass in a custom hasher).

Would you mind writing up your intended migration plans and sending to the ML, I think this is worth a discussion because it potentially affects code organization strategy going forward.

Also, I believe the additional encoder has a few more unit tests, it would be nice to make use of those for this class.

@emkornfield
Thanks for the comments. I will use the tests for this class.

IMO, the new design of the dictionary encoder includes at least two types of encoders:

a search based encoder which encode each element in O(logn) time, but requires no extra memory space.
a hash table based encoder which encode each element in O(1) time, but requires extra memory space.
I think each of the above encoders has its own merit.

In this PR, however, I only implement the search based encoder, because:

small changes are easier to review
the hash table based encoder was blocked by ARROW-6030.

I will open a separate issue for the hash table based encoder.
When that is finished, we will have something to start with about the migration plan.

What do you think?

tianchen92 · 2019-08-09T06:45:41Z

@liyafan82 I think what Micah mean is that the current encoder implementation is based on hash table the the questions #1 #2 #3 #5 mentioned above are no longer problems anymore :).

liyafan82 · 2019-08-09T08:44:43Z

@liyafan82 I think what Micah mean is that the current encoder implementation is based on hash table the the questions #1 #2 #3 #5 mentioned above are no longer problems anymore :).

@tianchen92 Maybe it is interesting to examine the above mentioned problems now.

It seems problems 3) and 4) still exists.

To see problem 3), note that since the encoder is based on a static method, for each vector to encode, we must create a new DictionaryHashTable and populate it with the dictionary vector. This is redundant work.

In addition, there is another problem with this encoder: the hash algorithm is fixed, which cannot be adjusted according to the needs of specific scenarios. This may lead to substantial performance penalty, since the dictionary data distribution has great influence on the performance of the hash table.

tianchen92 · 2019-08-09T09:51:47Z

@liyafan82 I think what Micah mean is that the current encoder implementation is based on hash table the the questions #1 #2 #3 #5 mentioned above are no longer problems anymore :).

@tianchen92 Maybe it is interesting to examine the above mentioned problems now.

It seems problems 3) and 4) still exists.

To see problem 3), note that since the encoder is based on a static method, for each vector to encode, we must create a new DictionaryHashTable and populate it with the dictionary vector. This is redundant work.

In addition, there is another problem with this encoder: the hash algorithm is fixed, which cannot be adjusted according to the needs of specific scenarios. This may lead to substantial performance penalty, since the dictionary data distribution has great influence on the performance of the hash table.

Thanks for your clarify, you are right for the two points above.

emkornfield · 2019-08-10T03:01:06Z

My point for issues 3, 4 and 5, I think they are easily addressable by migrating the static method to use an object based approach (and changing the APIs slightly).

I will open a separate issue for the hash table based encoder.
When that is finished, we will have something to start with about the migration plan.

I think a separate PR is a good idea. The state I'm trying to avoid is having duplicated/very similar functionality in two packages without clear guidance on which one to use. If we do decide to intend to migrate away from the existing encoder embedded in the Vector module there should at least be some notice/discussion on the ML.

emkornfield · 2019-08-10T03:01:41Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchDictionaryEncoder.java

+  /**
+   * A flag indicating if null should be encoded.
+   */
+  protected final boolean encodeNull;


resolved. thanks.

emkornfield · 2019-08-10T03:02:32Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchDictionaryEncoder.java

+   * Constructs a dictionary encoder.
+   * @param dictionary the dictionary. It must be in sorted order.
+   * @param comparator the criteria for sorting.
+   * @param encodeNull if null should be encoded.


this needs more clarification, what happens if it is true, what happens if it is false?

Good suggestion. Revised. Thanks.

emkornfield · 2019-08-10T03:03:01Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchDictionaryEncoder.java

+
+  /**
+   * Encodes an input vector by binary search.
+   * So the algorithm takes O(nlogm) time, where n is the length of the input vector,


Suggested change

* So the algorithm takes O(nlogm) time, where n is the length of the input vector,

* So the algorithm takes O(n*log(m)) time, where n is the length of the input vector,

Revised. Thanks.

emkornfield · 2019-08-10T03:08:34Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchDictionaryEncoder.java

+   * @param input the input vector.
+   * @param output the output vector.
+   */
+  public void decode(E input, D output) {


is there a reason to not keep this as a static method on the existing VectorEncoderClass, if we want to have a separately allocated vector that seems like an easy change?

It seems the existing DictionaryEncoder class does not support the encodeNull flag. In additin, for that method, the output vector is used as the return value, not a parameter. Do you want me to add a new static method there?

In response to this comment, I have extracted a static utility method for the decode functionality.

I don't think we need the decode method at all here, it seems like we can rely on the one in vector. Per above comment, I don't understand the purpose of the encodeNull flag on decode is.

The purpose of the encodeNull is discussed above.
You mean we can rely on the one in the static method?

emkornfield · 2019-08-10T03:11:25Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchDictionaryEncoder.java

+  public void decode(E input, D output) {
+    for (int i = 0; i < input.getValueCount(); i++) {
+      if (!encodeNull && input.isNull(i)) {
+        BitVectorHelper.setValidityBit(output.getValidityBuffer(), i, 0);


it seems like this relies on the array being preallocated to the correct size, is that intended behavior?

Nice catch. Thanks a lot. This is not intended.
I have revised the code by repeated calling the realloc method. Maybe this solution is not good enough. We may need to improve it in the future by changing APIs.

emkornfield · 2019-08-10T03:13:01Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchDictionaryEncoder.java

+   */
+  public void decode(E input, D output) {
+    for (int i = 0; i < input.getValueCount(); i++) {
+      if (!encodeNull && input.isNull(i)) {


should this be here? it seems like always want to set null if the input happens to be null?

I wonder if it is quicker to copy the null buffer if there are any null values instead of doing this one by one.

IMO, we do not always set null to tht output, if the input is null. We can treat null as a special element value, which takes 1 bit in the vilidity buffer, and 0 bit in the data buffer. Since it takes little space, we can use it to encode an extremely frequently used element to improve performance.

I agree that it is much faster to copy the validity buffer directly, if null encoding is disabled.

Thanks for the good suggestion.

liyafan82 · 2019-08-11T08:34:00Z

My point for issues 3, 4 and 5, I think they are easily addressable by migrating the static method to use an object based approach (and changing the APIs slightly).

I will open a separate issue for the hash table based encoder.
When that is finished, we will have something to start with about the migration plan.

I think a separate PR is a good idea. The state I'm trying to avoid is having duplicated/very similar functionality in two packages without clear guidance on which one to use. If we do decide to intend to migrate away from the existing encoder embedded in the Vector module there should at least be some notice/discussion on the ML.

@emkornfield I am wondering how you can change the static method to an object based approach without significantly changing the API.

I have started a discussion in the ML. Please take a look.

…g it easy to extend and reuse Related to [ARROW-6194](https://issues.apache.org/jira/browse/ARROW-6194). As discussed in #4994. Current static DictionaryEncoder has some limitation for extension and reuse. Slightly change the APIs and migrate static method to object based approach. Closes #5055 from tianchen92/ARROW-6194 and squashes the following commits: 2354c48 <tianchen> move build table logic to constructor 7b76cba <tianchen> add test 513f417 <tianchen> add static method back fb2d293 <tianchen> ARROW-6194: Make DictionaryEncoder non-static making it easy to extend and reuse Authored-by: tianchen <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>

emkornfield · 2019-08-15T05:40:41Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/EncoderUtils.java

+      output.copyFromSafe(index, i, dictionary);
+    }
+
+    // make sure the output vector is large enough


why is everything after the loop needed?

This is in response to your previous comment: if the encodeNull flag is disabled. The validity buffer for the input/output vectors are identical. So it is faster to copy the validity buffer entirely.
In addition, we must make sure the validity buffer is large enough, so we have the reallocation calls.

isn't the output.copyFromSafe enough to ensure the validity buffer size is big enough. won't skipping over indices above implicitly encode them as null?

For example, in the extreme case where the input vector is totally null, the copyFromSafe may never be called. So the output vector's capacity is not large enough to hold the validity bits.

This code should be placed inside the if statement.

emkornfield · 2019-08-15T05:41:48Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/EncoderUtils.java

+  public static <E extends BaseIntVector, D extends ValueVector> void decode(
+          E input, D output, D dictionary, boolean encodeNull) {
+    for (int i = 0; i < input.getValueCount(); i++) {
+      if (!encodeNull && input.isNull(i)) {


I still don't think I understand why encodeNull has any part in this logic. If the input vector has a null value, we should honor that.

Let's have some analysis over the encodeNull flag:

When the encodeNull is disabled:

There is no memory saving, since we are using a int vector, for which a null value also takes space.
The output element is set to null directly, and so there is no overhead for searching for the key in the dictionary.

When the encodeNull is enabled:

We need to search for the null element in the dictionary.
For each iteration of the encode/decode loop, there is no branch statement, and there is no need to check the validity bit. This leads to performance gain.

What do you think?

For cases where null values are rare (or no nulls at all), the performance gain by enabling encodeNull can be significant, because otherwise, we need to do the checking for each encode/decode iteration.

That being said, it seems weird to ever have to encode a null. So if you insist, we can remove it.

However, for other types of encoding, like Huffman encoding/arithmetic encoding, encoding null can be really beneficial.

my issue is encodeNull is providing should be intrinsic to the vector (whether there are any nulls are present). So either this there should be a call to getNullCount which right now isn't O(1), so it wouldn't be great performance wise, but its the best we have. I think it is hard to make the guarantee that the indices will all be all non-null even though you are passing in the flag.

I see your point. Thanks for the great suggestion!

Sorry, please hold. My current code is not correct.

Whether we should encode a null should not depend on if there is a null in the dictionary.
An encode null flag should be enforced from outside.

To see this, note that if there is a null in the input and there is no null in the dictionary,

if encodeNull = true, we should throw an exception (the null is unexpected)

if encodeNull = false, we should simply set the output element to null.

I will change the code accordingly when I have a time.
Thanks a lot.

Suppose we do not have the external flag, when the encoder encounters a null, it has only one choice: simple output a null.
However, this behavior can be incorrect: if the encoder has encodeNull set, it should never see a null in the input. So the correct behavior should be throwing an exception.

My argument is for a general purpose library we shouldn't be picky at this level. If you have a system invariant built on top of this library, the place to do the validation is at the higher level.

@emkornfield . I see your points and agree with your points. Thanks for the clarification.

It seems the above discussions are getting more complicated than necessary. So I want to have a brief summary, and clear things out.

About the encodeNull flag, it simply determines encoding/decoding behaviors: should it simply outputs a null, or search for the null in the dictionary.

The benefits of turning on the encodeNull flag:

No need to check the validity bit in each iteration.

Removes the if statement in the loop

The shortcoming of turning on the encodeNull flag:

it searches the null value in the dictionary, which may lead to performance overhead.

So what is your suggestion? Should we keep the flag, or remove it?

my suggestion is to not include decoding in this code at all, but centralize it at one place.

Also, ss it stands now, it still doesn't seem like the code in the vector package is on the path to deprecation since @tianchen92 is continuing to put functionality there?

my suggestion is to not include decoding in this code at all, but centralize it at one place.

Sounds reasonable. I will revise the code accordingly later.

Also, ss it stands now, it still doesn't seem like the code in the vector package is on the path to deprecation since @tianchen92 is continuing to put functionality there?

Sure. No problem. We will do it only when the timing is right.

@emkornfield
I have removed the encode API from the encoder class, and the EncodeUtil class is also removed.
In the tests, the decode functionality is based on DictionaryEncoder#decode static method in the vector module.
Please take a look.

emkornfield · 2019-08-17T02:48:17Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/EncoderUtils.java

+    // vectors are identical, so we can improve performance by copying
+    // the buffer directly.
+    if (!encodeNull) {
+      // make sure the output vector is large enough


does this replicate code in setValue count?

Good point. Thank you.
setValueCount will do the necessary reallocate.

emkornfield · 2019-08-17T02:49:07Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/EncoderUtils.java

+        output.reAlloc();
+      }
+
+      PlatformDependent.copyMemory(input.getValidityBuffer().memoryAddress(),


this would already be set in the loop above though or in setValueCount if all values were null?

If the output buffer is a fresh vector, all its validity bits are clear by default.
If it is used from a previous run, its validity bits are in an undefined state.

To make it safe, we should copy the validity buffer from input to output?

I don't think that is right, we should either require the input vector is a new state or call allocateNew on it before starting the decode process (see docs on ValuteVector.)

Thanks for the good points. I agree with your points in general.
Each vector has its own life cycle, and by respecting the life cycle, the application code can be easy to maintain and reason about.

I have revised the code, and made the requirements explicit in the javadoc.

emkornfield · 2019-09-17T05:08:18Z

...gorithm/src/test/java/org/apache/arrow/algorithm/dictionary/TestSearchDictionaryEncoder.java

+  }
+
+  @Test
+  public void testEncodeLargeVector() {


what is this test testing?

It tests the scenario where the vector to encode is large. In this test, the vector has a length of 10000.
This test was mainly copied from a test from the vector module, according to one of your previous suggestion.

emkornfield · 2019-09-17T05:08:31Z

...gorithm/src/test/java/org/apache/arrow/algorithm/dictionary/TestSearchDictionaryEncoder.java

+    }
+  }
+
+  @Test(expected = IllegalStateException.class)


use assertThrows instead of this annotation.

Also, I think this should be IllegalArgumentException.

use assertThrows instead of this annotation.

Revised. Thanks.

Also, I think this should be IllegalArgumentException.

Sounds good. Thanks.

emkornfield · 2019-09-17T05:09:30Z

...gorithm/src/test/java/org/apache/arrow/algorithm/dictionary/TestSearchDictionaryEncoder.java

+
+      // the encoder should encode null, but no null in the dictionary
+      // an exception should be thrown.
+      SearchDictionaryEncoder<IntVector, VarCharVector> encoder =


I don't think its a problem that the dictionary doesn't have a null in it unless, the vector getting encoded does have a null in it

Sounds reasonable.
The related logic is removed from the constructor, and if there is a null in the vector to encode, an exception will be thrown when encoding.

emkornfield · 2019-09-17T05:10:55Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchDictionaryEncoder.java

+
+      int index = VectorSearcher.binarySearch(dictionary, comparator, input, i);
+      if (index == -1) {
+        throw new IllegalStateException("The data element is not found in the dictionary: " + i);


IllegalArgument is potentially better here.

Sure. Revised.

emkornfield

Thank you for the changes, left a few more comments and then i think this can be merged.

emkornfield · 2019-09-18T04:44:01Z

+1, thank you.

fsaintjacques added the Component: Java label Aug 2, 2019

liyafan82 force-pushed the fly_0712_encode branch from a916c15 to a91c811 Compare August 9, 2019 12:43

emkornfield requested changes Aug 10, 2019

View reviewed changes

tianchen92 mentioned this pull request Aug 10, 2019

ARROW-6194: [Java] Add non-static approach in DictionaryEncoder making it easy to extend and reuse #5055

Closed

liyafan82 force-pushed the fly_0712_encode branch from a91c811 to 0235cab Compare August 11, 2019 07:51

emkornfield reviewed Aug 15, 2019

View reviewed changes

liyafan82 force-pushed the fly_0712_encode branch 2 times, most recently from 6356468 to e4a5295 Compare August 16, 2019 11:53

liyafan82 closed this Aug 16, 2019

liyafan82 force-pushed the fly_0712_encode branch from e4a5295 to 06fd2da Compare August 16, 2019 11:59

liyafan82 reopened this Aug 16, 2019

emkornfield reviewed Aug 17, 2019

View reviewed changes

liyafan82 force-pushed the fly_0712_encode branch 3 times, most recently from 7648fdc to c9cb12f Compare August 20, 2019 05:45

liyafan82 force-pushed the fly_0712_encode branch from c9cb12f to 315a8ea Compare September 9, 2019 10:18

emkornfield reviewed Sep 17, 2019

View reviewed changes

emkornfield requested changes Sep 17, 2019

View reviewed changes

[ARROW-5917][Java] Redesign the dictionary encoder

8b699a8

liyafan82 force-pushed the fly_0712_encode branch from 315a8ea to 8b699a8 Compare September 17, 2019 12:58

emkornfield closed this in 149efd9 Sep 18, 2019

liyafan82 mentioned this pull request Sep 20, 2019

ARROW-6184: [Java] Provide hash table based dictionary encoder #5058

Closed

This was referenced Sep 18, 2019

[Java] Redesign the dictionary encoder #22327

Closed

[Java] Add non-static approach in DictionaryEncoder making it easy to extend and reuse #22586

Closed

	* So the algorithm takes O(nlogm) time, where n is the length of the input vector,
	* So the algorithm takes O(n*log(m)) time, where n is the length of the input vector,

ARROW-5917: [Java] Redesign the dictionary encoder #4994

ARROW-5917: [Java] Redesign the dictionary encoder #4994

Conversation

liyafan82 commented Aug 2, 2019

codecov-io commented Aug 2, 2019 • edited Loading

Codecov Report

emkornfield commented Aug 9, 2019

liyafan82 commented Aug 9, 2019 • edited Loading

tianchen92 commented Aug 9, 2019

liyafan82 commented Aug 9, 2019 • edited Loading

tianchen92 commented Aug 9, 2019

emkornfield commented Aug 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyafan82 Aug 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyafan82 Aug 11, 2019 • edited Loading

Choose a reason for hiding this comment

liyafan82 commented Aug 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyafan82 Aug 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyafan82 Sep 7, 2019 • edited Loading

Choose a reason for hiding this comment

liyafan82 Sep 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield left a comment

Choose a reason for hiding this comment

emkornfield commented Sep 18, 2019

codecov-io commented Aug 2, 2019 •

edited

Loading

liyafan82 commented Aug 9, 2019 •

edited

Loading

liyafan82 commented Aug 9, 2019 •

edited

Loading

liyafan82 Aug 11, 2019 •

edited

Loading

liyafan82 Aug 11, 2019 •

edited

Loading

liyafan82 Aug 15, 2019 •

edited

Loading

liyafan82 Sep 7, 2019 •

edited

Loading

liyafan82 Sep 9, 2019 •

edited

Loading