-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better types #11713
better types #11713
Conversation
This pull request introduces 2 alerts when merging 7d433cb into 1370fcf - view on LGTM.com new alerts:
|
btw, this PR is going to have lots of conflicts since |
ExprType.STRING_ARRAY, | ||
14, | ||
ExpressionType.STRING_ARRAY, | ||
15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes in values are because arrays are stored with an extra byte now (one for ARRAY
, one for element type).
I've taken some measurements to check to see if there is any performance regression from the changes in this PR, and it looks ok to me. I chose expressions to measure since non-vectorized expression processing is the most intensive use of type stuff since things are checked every row, the majority of other uses of type information is per segment or per query. before:
after:
|
I've been doing a lot of testing with a mix of historicals, brokers, and routers that do and do not have the changes of this PR to determine how queries will behaving during various upgrade scenarios. So far, the only issues I have run into are when you have updated brokers but old historicals (which is not the recommended upgrade order), and only for queries with subqueries which have array or complex types, e.g. something like:
If you have an updated broker but older historicals, you will observe an error of this form:
because the inline datasource which is pushed down to the historicals is not forgiving with type information. I'm not sure if this is a big deal, since we do not recommend updating in this order, and I haven't run into any query shapes that run into issues when following the recommended order, so I think we might be able to get away without having some configuration option to force serialization of arrays and complex types in the old way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for a nice PR. I'm looking forward to follow-ups 🙂. The benchmark result seems nice as I don't see any regression. The change in this PR looks good overall, I left one comment about documentation.
I'm not sure if this is a big deal, since we do not recommend updating in this order, and I haven't run into any query shapes that run into issues when following the recommended order, so I think we might be able to get away without having some configuration option to force serialization of arrays and complex types in the old way.
Thanks for doing upgrade tests! It sounds good to me. We can just document this as a known issue when the order doesn't match to the recommended one, maybe in the release notes.
import javax.annotation.Nullable; | ||
|
||
@JsonSerialize(using = ToStringSerializer.class) | ||
public class ColumnType extends BaseTypeSignature<ValueType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document somewhere how ColumnType
vs ExpressionType
vs ColumnCapabilities
are different and being used?
PR apache#11882 introduced a type comparison using ==, but while it was in flight, another PR apache#11713 changed the type enum to a class. So the comparison should properly be done with "equals".
Hi, @clintropolis . Thanks a lot for this great patch. May I ask is there any way to solve this compatibility problem, otherwise after adding this patch, all the fields except the |
Hmm, could you give some additional details? I don't observe this behavior when trying to reproduce locally, what is the output of |
@clintropolis Thanks for your comment. The reproduce process is that I first built a cluster with Druid 0.22.0, wrote a |
@clintropolis BTW, the commit id of the Druid I used is f6e6ca2 , which is recent enough I believe. |
Is the cluster fully updated? I think it would be possible for types to be displayed temporarily as unknown complex if the broker is upgraded before the historical servers because we've changed which segment metadata query field populates the types in the druid schema (#11895), but that is not the recommended upgrade order so I did not go out of my way to support it. It would probably be possible to fallback to the old type field on the broker if the upgrade is done out of order, but ideally the brokers should never be upgraded before historicals. |
@clintropolis Yes, the cluster has be updated fully. Cool, I got it. In addition to the upgrade of Broker and Historical nodes, the order of upgrade needs to be considered, what about the other 😅 |
If the order of upgrading is |
Are the types still incorrect in your cluster? The issue I mentioned I think would be temporary, though I guess it could take some time before the issue corrects itself because it populates reactively, but I think could be resolve immediately by restarting the brokers
I did a fair bit of compatibility testing around upgrade order so I believe so, though the bulk of it was before #11895 went in, which is the change that added a new field to the segment metadata query response to populate the types in |
@clintropolis Yes, I just rolling restarted Broker nodes, and the problem disappeared automatically. However, if the Broker is not restarted, the problem will persist. Thanks a lot for your comments. I think that it is useful to make more explanations in relevant documents to remind it, right 😄 |
Yeah, i'll make sure that we call out in the release notes that following suggested upgrade order is going to be important for 0.23 if I don't get around to adding a fallback to try to populate the type from the legacy field before the release. |
I created #12016 to reduce the blast radius of upgrading a cluster out of order after #11895 so it will at least no longer be stuck with |
@clintropolis Thanks a lot for completing the improvement so quickly. This is very useful. 👍 |
This change mimics what was done in PR apache#11917 to fix the incompatibilities produced by apache#11713. apache#11917 fixed it with AggregatorFactory by creating default methods to allow for extensions built against old jars to still work. This does the same for PostAggregator (cherry picked from commit eb0bae4)
This change mimics what was done in PR apache#11917 to fix the incompatibilities produced by apache#11713. apache#11917 fixed it with AggregatorFactory by creating default methods to allow for extensions built against old jars to still work. This does the same for PostAggregator (cherry picked from commit eb0bae4)
Description
This PR enriches the native engine type system to use a more powerful set of classes than the existing enumerations,
ValueType
andExprType
, which are primarily used right now.This is the start of a series of changes that will end up with 'complex metrics' being rebranded 'complex types' and being fully usable throughout the query system, including for expressions, grouping, and filtering.
To share the same structure and utilities when working with both the native engine and expressions,
TypeSignature
has been added to model basic type information, and is implemented by the new
ColumnType
andExpressionType
classes, as well asColumnCapabilities
.TypeDescriptor
is an interface shared by bothExprType
andValueType
,to expose some common facts about both sets of enums.
To aid in the construction of these types:
has also been added. This is interning the type objects so that they are re-used in the engine to prevent the creation of a large amount of garbage (operating on the assumption that there won't be that many types and that they will be frequently used).
Beyond these new structures, the vast majority of changes in this PR are around replacing all occurrences of
ValueType
withColumnType
, andExprType
withExpressionType
, and adjusting type checking code accordingly.JSON Serialization
To be as compatible as possible, serialization is currently done with a string based format.
LONG
,STRING
,FLOAT
,DOUBLE
all remain the same as they were whenValueType
was king, and serde as the appropriateColumnType
(orExpressionType
).Array types have been significantly reworked to take advantage of the new structure: instead of dedicated typed arrays,
LONG_ARRAY
etc,ValueType
andExprType
now have a singleARRAY
type, andelementType
contains a reference to the internal type. In the string serialized form, these now look likeARRAY<LONG>
,ARRAY<STRING>
, etc, but can translate the legacy values if encountered.Finally,
COMPLEX
will read into an unknown complex type, but if the type information is present, will serialize asCOMPLEX<typeName>
.I've added object JSON constructors too, in order to prepare for a future where we might want to use objects instead of this string based serde, but the strings are adequately flexible for now I think, since it will be possible to model all sorts of types that were not possible in the previous system, such as
ARRAY<ARRAY<COMPLEX<typeName>>
if we want to get wild with it.SQL and INFORMATION_SCHEMA
The only change in this PR that will be apparent to most users is that now that complex type information is preserved through-out the engine, the
INFORMATION_SCHEMA
columns table can display the complex type information instead ofOTHER
:The JDBC type for complex is still reported as
OTHER
because i need to do some additional investigation on if there is anything more appropriate (it has to be something injava.sql.Types
, but we could possibly consider setting the classname to the complex type instead ofjava.lang.Object
, I need to better learn the rules about how JDBC is supposed to behave)Future work
I think there is quite a lot of stuff we can do once this PR goes in, here is the short list of most immediate things I can think of.
Complex expressions
This is likely to be one of the first follow-up PRs I do after this, since the new type system will support making complex expressions, which is one of the last feature gaps of the expression system. After this there will be very little that cannot be done with expressions that is supported by the native engine.
Better input validation
There is a lot we could do after this PR to take advantage of the additional information it makes available. SQL planning and aggregator factories could both take advantage of the preserved complex type information to perform better validation on their inputs, and better behavior or error messaging whenever inappropriate types have been used. Similarly, this additional type information also should make it easier to make complex aggregator factories which are not split into separate build and merge factories since we should have adequate type information to be able to build the correct type of aggregator by examining the input type information.
Better arrays
The updated type system in this PR supports modeling nested arrays and arrays of complex types, but the array expressions have not been updated to actually create or support them.
Key changed/added classes in this PR
ValueType
ExprType
TypeDescriptor
TypeFactory
TypeSignature
ColumnType
ColumnTypeFactory
ExpressionType
ExpressionTypeFactory
ColumnCapabilities
RowSignature
Calcites
RowSignatures
This PR has: