Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add optional 'castToType' parameter to 'auto' column schema #15417

Merged
merged 7 commits into from
Nov 29, 2023

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 23, 2023

Description

This PR updates the 'auto' schema to add a new castToType parameter which accepts a ColumnType to allow forcing the 'auto' indexer to cast values to a specific type. One of the main use cases for this is MSQ array ingestion, which can now specify the exact desired type, but also allows known type columns to use all of the new features provided by 'auto' columns such as numeric indexes and array type columns for any other native ingestion. This only works for primitive/primitive array types, and will be ignored if any other type is used. The column part serde stores an extra enforceLogicalType which is used to ensure the correct casting schema is created for stuff like auto-compaction.

Release note

tbd


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Nov 23, 2023
@@ -107,7 +107,7 @@ public Object getObject()
if (expressionType == null) {
return delegateDimensionSelector.getObject();
}
return ExprEval.ofType(expressionType, delegateDimensionSelector.getObject()).value();
return ExprEval.bestEffortOf(delegateDimensionSelector.getObject()).castTo(expressionType).value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the reason behind this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug i noticed while testing some stuff with MSQ, ofType assumes that the typing information is absolutely correct and so will do what you tell it, which ends up doing some not great things. For example, without bestEffort/explicit cast, ingesting some column specifieid as ARRAY<STRING> in an extern in MSQ that has some rows with arrays of nested objects will do something like this
Screenshot 2023-11-21 at 8 26 03 PM which should instead be a parse exception

@@ -73,7 +73,7 @@ public static DimensionSchema createDimensionSchema(
.getDimensionSchema(capabilities);
}

return new AutoTypeColumnSchema(column);
return new AutoTypeColumnSchema(column, type != null && type.is(ValueType.COMPLEX) ? type : null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be
type != null && !type.is(ValueType.COMPLEX)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you can't cast to complex types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops yea, or should just leave it null

return new EncodedKeyComponent<>(StructuredData.wrap(eval.value()), effectiveSizeBytes);
}
catch (IAE invalidCast) {
throw new ParseException(eval.asString(), invalidCast, "Cannot coerce to requested type [%s]", castToType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include the field / column name here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable, but will make sure whatever catches these doesn't already do something like that

@@ -148,20 +152,26 @@ public void writeMergedValueDictionary(List<IndexableAdapter> adapters) throws I
final FieldTypeInfo.MutableTypeSet rootTypes = mergedFields.get(NestedPathFinder.JSON_PATH_ROOT);
final boolean rootOnly = mergedFields.size() == 1 && rootTypes != null;

final ColumnType explicitType;
if (castToType != null && (castToType.isPrimitive() || castToType.isPrimitiveArray())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you actually need the (castToType.isPrimitive() || castToType.isPrimitiveArray()) check here or is it just defensive coding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defensive

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@clintropolis clintropolis merged commit 97623b4 into apache:master Nov 29, 2023
83 checks passed
@clintropolis clintropolis deleted the explicit-auto-type branch November 29, 2023 01:19
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying Area - Segment Format and Ser/De
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants