-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query range fields by doc values when they are expected to be more efficient than points #24823
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good overall. My main concern is that given how the encoding works, we have to deserialize the ranges at query time back to two Comparable
instances. I think we should design it in such a way that queries can work directly on the encoded representation?
@Override | ||
public boolean matches() throws IOException { | ||
BytesRef encodedRanges = values.binaryValue(); | ||
NormalizedRange[] ranges = rangeType.decodeRanges(encodedRanges); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we work directly on the encoded bytes instead of decoding?
|
||
@Override | ||
public int hashCode() { | ||
return classHash() + Objects.hash(fieldName, queryType, expectedRange); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I'd prefer return Objects.hash(getClass(), fieldName, queryType, expectedRange);
which is supposed to be less subject to collisions.
public BytesRef encodeRanges(Set<Range> ranges) throws IOException { | ||
final byte[] encoded = new byte[ByteUtils.MAX_BYTES_VLONG + (16 * 2) * ranges.size()]; | ||
ByteArrayDataOutput out = new ByteArrayDataOutput(encoded); | ||
out.writeVInt(ranges.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we use a fixed length, maybe we do not need to encode the number of values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see it is to be consistent with other types which use vlong
The idea here is that we can save some space by encoding the ranges as two vlongs. Downside is that we can't work on the encoded representation. Do you think that the vlong encoding isn't worth the effort here? Personally I'm ok with changing the encoding here, so that we don't to encode at query time, but just wondering if we should make a tradeoff between space or query speed. |
You convinced me that we need to perform some sort of compression for numeric values. However I think we should use an encoding that doesn't require the query to know whether the field stores longs or ip addresses? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
} | ||
if (includeTo) { | ||
to = rangeType.nextUp(to); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit backward to me, we would usually rather do
if (includeFrom == false) {
from = rangeType.nextUp(from);
}
Do I miss something?
ByteArrayDataInput in = new ByteArrayDataInput(encodedRanges.bytes, encodedRanges.offset, encodedRanges.length); | ||
int numRanges = in.readVInt(); | ||
for (int i = 0; i < numRanges; i++) { | ||
int length = in.readVInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first bytes are not a vint when we encode eg. longs? So I don't think it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In RangeFieldMapper.RangeType#encodeLongRanges(...)
each encoded value is preceded by a vint.
length = in.readVInt(); | ||
bytes = new byte[length]; | ||
in.readBytes(bytes, 0, length); | ||
BytesRef otherTo = new BytesRef(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to avoid ByteArrayDataInput and BytesRef allocations since this method may be called in a tight loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove the ByteRef instance creation and use just plain byte[].
The reason I used ByteArrayDataInput
is because of its readVInt()
and readBytes()
methods. Are there static alternatives that I can use? Otherwise maybe it is sufficient enough if I keep a spare instance around that I keep reusing?
// does not disjoint AND not within: | ||
result = (from.compareTo(otherTo) > 0 || to.compareTo(otherFrom) < 0) == false && | ||
(from.compareTo(otherFrom) <= 0 && to.compareTo(otherTo) >= 0) == false; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we put a method on the queryType enum instead of this switch?
} | ||
return encoded; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering that putting these helpers into a dedicated class would make things easier to read?
if (sign == 1) { | ||
encoded[encoded.length - 1 - b] = (byte) (l >>> (8 * b)); | ||
} else if (sign == 0) { | ||
encoded[encoded.length - 1 - b] = (byte) (0xFF - (l >>> (8 * b))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we miss a mask here: encoded[encoded.length - 1 - b] = (byte) (0xFF - ((l >>> (8 * b)) & 0xFF));
} | ||
long max = Long.MAX_VALUE / 2; | ||
return (max + max) * random().nextLong() - max; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had similar random number generation in eg. TestIntRangeFieldQueries
but this proved inefficient at finding bugs (eg. it makes it very unlikely to get twice the same value in the same test) so we changed it in https://issues.apache.org/jira/browse/LUCENE-7847. Maybe we should do something similar here (as well as for doubles, ip addresses, etc.)
private final QueryType queryType; | ||
private final NormalizedRange expectedRange; | ||
|
||
public BinaryDocValuesRangeQuery(String fieldName, RangeFieldMapper.RangeType rangeType, QueryType queryType, Object from, Object to, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid passing the range type? I think it would be more robust if it does not need to know about the type of values that are stored and works directly on the encoded data?
@jpountz Thanks for reviewing. I've updated the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good overall, I left some comments. I'm wondering whether we should sort the ranges before encoding them as it might allow for optimizations in the multi-valued case eg. if the max value of the query is less than the min value of the first encoded range.
It would be nice to have some benchmarks as well in order to make sure that this actually performs better than point-based ranges when the range does not drive iteration (it might make more sense to do it after LUCENE-7828 is merged). I can help with that.
int numRanges = in.readVInt(); | ||
for (int i = 0; i < numRanges; i++) { | ||
otherFrom.length = in.readVInt(); | ||
in.readBytes(otherFrom.bytes, 0, otherFrom.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could avoid the copy by doing this instead:
otherFrom.bytes = encodedRanges.bytes;
otherFrom.offset = in.position();
in.skip(otherFrom.length);
and similarly for the to value?
return new BytesRef(encoded, 0, out.getPosition()); | ||
} | ||
|
||
static byte[] encode(long l) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add javadocs saying that this is variable-length encoding that preserves ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we need some unit tests for that method in order to ensure it actually preserves ordering?
l = Double.doubleToRawLongBits(d); | ||
sign = 1; // means positive | ||
} | ||
return encode(l, sign); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doubles should probably use a different encoding, but we can work on it as a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we should specialize floats too
@jpountz I've updated the pr.
I've added the sorting, but I think it will only make sense when there multiple adjacent ranges and when ranges are overlapping it wouldn't help that much?
Thanks! I guess we will need a dataset that when benchmarked with a specific query forces the range query to be executed in random access mode. |
Right I don't think this is a game changer for queries in general. However it might be the case for aggregations? For instance, say you want to run an histogram aggregation on a range field, it would be easier with sorted ranges, since you can easily track which buckets have already been incremented and which buckets haven't? |
I merged LUCENE-7828 yesterday so that we can better benchmark this change, but upgrading to a new Lucene 7 snapshot is currently blocked on #25028 since the postings highlighter has been removed from Lucene master. It should hopefully be merged in the next few days. |
c8f6f44
to
6aaab01
Compare
(Updated the benchmark to index the docs into arbitrary order and captured the 50th precentile instead of the 99th because it is less noisy) This are the 50th percentile latency results between master (baseline) and this pr (contender) of the new benchmark added to rally:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to do 2 things before merging:
- back compat in the mapper so that the pre-6.0 mappers know they do not have doc values
- sort ranges by from then to
|
||
@Override | ||
public String toString(String field) { | ||
return "BinaryDocValuesRangeQuery(fieldName=" + field + ")"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put from and to in the toSTring() ?
long r2From = ((Number) r2.from).longValue(); | ||
long r2To = ((Number) r2.from).longValue(); | ||
long middle2 = r2From + ((r2To - r2From) / 2); | ||
return Long.compare(middle1, middle2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather sort by from then to, feels like something that is easier to rely on.
@@ -194,7 +201,7 @@ public RangeFieldType(RangeType type) { | |||
super(); | |||
this.rangeType = Objects.requireNonNull(type); | |||
setTokenized(false); | |||
setHasDocValues(false); | |||
setHasDocValues(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to handle backcompat in the parser to make sure we use false
as a default value if the index was created before 6.0?
@jpountz I've updated the pr. |
…ficient than points. * Enable doc values for range fields by default. * Store ranges in a binary format that support multi field fields. * Added BinaryDocValuesRangeQuery that can query ranges that have been encoded into a binary doc values field. * Wrap range queries on a range field in IndexOrDocValuesQuery query. Closes elastic#24314
Closes #24314