Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add isJson UDF #12603

Merged
merged 5 commits into from
Mar 12, 2024
Merged

Add isJson UDF #12603

merged 5 commits into from
Mar 12, 2024

Conversation

tibrewalpratik17
Copy link
Contributor

@tibrewalpratik17 tibrewalpratik17 commented Mar 8, 2024

label:
feature

WHY ARE WE ADDING THIS?

  • This will be helpful in validating String-columns with json-indexes during ingestion time. We can use this UDF as a filter-transform config.
  • If we do not use skipInvalidJson in json-index config introduced in skip invalid json string rather than throwing error during json indexing #12238 then the ingestion stops in case of malformed / truncated json.
  • If we use skipInvalidJson in json-index, then ingestion in not stopped but the record becomes un-queryable with json_match queries anyways. We also end up persisting malformed data in the table which can potentially cause that particular primary-key to not come in query response at all (if using upsert). This is again an issue. Additionally, there is no log / metric to track such scenarios where skipInvalidJson got applied.

Adding a json-validator as a UDF so that we can use it in filterTransform config, add alerts on NUMBER_ROWS_FILTERED metric and using this #12602 immediately track the events which are malformed / truncated.

UDFS IN OTHER DBS

Saw few DBs have similar relatable functionalities too.

Did not find anything similar in Postgresql though.

Null value

Putting nulls as a valid json here as during partial-upserts we are okay with having nulls. They will be overwritten.
But open to suggestions on how can we handle nulls differently.

@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.77%. Comparing base (59551e4) to head (7d9eb40).
Report is 107 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12603      +/-   ##
============================================
+ Coverage     61.75%   61.77%   +0.02%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2453      +17     
  Lines        133233   133828     +595     
  Branches      20636    20753     +117     
============================================
+ Hits          82274    82676     +402     
- Misses        44911    45043     +132     
- Partials       6048     6109      +61     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.74% <100.00%> (+0.03%) ⬆️
java-21 61.66% <100.00%> (+0.03%) ⬆️
skip-bytebuffers-false 61.76% <100.00%> (+0.02%) ⬆️
skip-bytebuffers-true 61.63% <100.00%> (+33.90%) ⬆️
temurin 61.77% <100.00%> (+0.02%) ⬆️
unittests 61.77% <100.00%> (+0.02%) ⬆️
unittests1 46.91% <100.00%> (+0.02%) ⬆️
unittests2 27.68% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tibrewalpratik17 tibrewalpratik17 force-pushed the isJson branch 2 times, most recently from c2a4f27 to 4a55b7c Compare March 9, 2024 16:34
@Jackie-Jiang Jackie-Jiang merged commit f14f1a4 into apache:master Mar 12, 2024
19 checks passed
Copy link
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tibrewalpratik17 for the PR, left a comment for the clarification.


/**
* Checks whether the input string can be parsed into a json node or not. Useful for scenarios where we want
* to filter out malformed json. In case of nulls we return null itself, as null can be treated as valid json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement does not seem to be aligned with what is implemented and merged here. If the argument is null, it returns false as JsonUtils.stringToJsonNode(inputStr); throws an Exception.

The discussion in the comment thread is also confusing to what is finally added here. It is good to add the conclusion in that thread for readers if it is decided based on offline discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tibrewalpratik17 We may update the javadoc. null values are handled by the caller (function invoker), instead of within this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

* in partial-upsert scenarios. Upto the user to use null response accordingly.
*
* @param inputStr Input string to test for valid json
* @return in case of null value, it returns null. In case of non-null, it returns true in case of valid json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as earlier for a null argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants