Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect mismatches in begin and end tokens returned by JSON tokenizer FST #17471

Merged
merged 9 commits into from
Dec 19, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Dec 2, 2024

Description

Addresses #15820

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Dec 2, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 2, 2024
@shrshi shrshi added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Dec 2, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Dec 2, 2024

/ok to test

@shrshi shrshi marked this pull request as ready for review December 2, 2024 15:00
@shrshi shrshi requested a review from a team as a code owner December 2, 2024 15:00
@shrshi shrshi requested review from bdice and vuule December 2, 2024 15:00
cuda::proclaim_return_type<std::uint8_t>([] __device__(auto token) -> std::uint8_t {
std::uint8_t token_bits{0};
switch (token) {
case token_t::StructBegin: [[fallthrough]];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for [[fallthrough]] when the case is empty. I misunderstood that guideline at first so we have a lot of unnecessary [[fallthrough]]s in the code 😬

CUDF_EXPECTS(h_tokens[0] == token_t::ValueBegin,
"Some begin token does not have a matching end token");
} else {
auto not_ok = thrust::transform_reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work when we have two (same) begin tokens without any end tokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic has been replaced a check on the logical stack - if the logical stack is non-empty after the bracket-brace FST then the last record is incomplete.

@shrshi
Copy link
Contributor Author

shrshi commented Dec 16, 2024

When the last JSONL record in the input is incomplete, and error recovery is set to FAIL, then an error token is not emitted after the entire input is read, which gives rise to above bug. Even though a newline character is inserted after the last record, error tokens are emitted on line break only when recover_with_null is enabled. Instead of modifying the translation table for the PDA, a simpler approach for detecting if the last line in incomplete is to inspect the logical stack after the bracket-brace FST. For valid inputs, the stack should be empty so the incomplete last line error can be captured.

@shrshi
Copy link
Contributor Author

shrshi commented Dec 19, 2024

/merge

@rapids-bot rapids-bot bot merged commit dfb7c11 into rapidsai:branch-25.02 Dec 19, 2024
105 checks passed
shrshi added a commit to shrshi/cudf that referenced this pull request Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants