Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds type inference and type conversion for leaf-columns to the neste…
…d JSON parser (#11574) Adds type inference and type conversion for leaf-columns to the nested JSON parser **Note to the reviewers**: It's important to note that we're talking about two different stages of quote-stripping here. 1. Including/excluding quotes in the tokenizer stage (currently always set to `true` using a `constexpr bool`) 2. Including/excluding quotes in the type conversion stage Currently, we always include quotes in the tokenizer stage (1), such that the type casting stage (2) can differentiate between string values and literals (e.g. `[true, "true"]`) and, based on the user-provided choice in `json_reader_options::keep_quotes`, can strip off the quotes or keep them in the values returned to the user. **In addition to adding type inference and type casting:** - Switches logic for inferring nested columns. Inferring any column with at least one nested item (list or struct) as that respective nested column, making all other _non-nested_ items of that column invalid. E.g., `[null,{"a":1},"foo"] => List<Struct<a:int>> with struct col validity: 0, 1, 0` - Adds option for `keep_quotes` to differentiate between string values and numeric & literal values, like (`123.4`, `true`, `false`, `null`). - Migrated libcudf test to cudf test to avoid having large byte BLOBs in source file - Changing column order to match the behaviour of pandas and existing JSON lines reader. That is, column order corresponds to the order they were discovered in: `[{"b":1, "c":1}, {"a":1}] => order: <b, c, a>` - Support for escape sequences (see below) ## Performance comparison ### Tokenizer The following is a comparison of the **JSON tokenizer** stage before this PR and after: #### Before ``` # Benchmark Results ## json_tokenizer ### [0] Tesla V100-SXM2-32GB | string_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | |-------------------|---------|-----------|-------|-----------|-------|----------| | 2^20 = 1048576 | 2176x | 2.489 ms | 9.62% | 2.480 ms | 9.61% | 422.729M | | 2^21 = 2097152 | 1936x | 2.501 ms | 7.14% | 2.492 ms | 7.12% | 841.482M | | 2^22 = 4194304 | 1152x | 2.612 ms | 5.43% | 2.604 ms | 5.42% | 1.611G | | 2^23 = 8388608 | 1456x | 2.855 ms | 4.26% | 2.847 ms | 4.23% | 2.947G | | 2^24 = 16777216 | 1104x | 3.395 ms | 5.34% | 3.387 ms | 5.33% | 4.954G | | 2^25 = 33554432 | 560x | 4.410 ms | 2.25% | 4.402 ms | 2.25% | 7.623G | | 2^26 = 67108864 | 1552x | 6.482 ms | 2.23% | 6.473 ms | 2.22% | 10.367G | | 2^27 = 134217728 | 1435x | 10.430 ms | 2.70% | 10.422 ms | 2.70% | 12.879G | | 2^28 = 268435456 | 815x | 18.396 ms | 1.95% | 18.387 ms | 1.95% | 14.599G | | 2^29 = 536870912 | 15x | 34.389 ms | 0.42% | 34.381 ms | 0.42% | 15.615G | | 2^30 = 1073741824 | 11x | 66.097 ms | 0.20% | 66.088 ms | 0.20% | 16.247G | ``` #### After ``` # Benchmark Results ## json_tokenizer ### [0] Tesla V100-SXM2-32GB | string_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | |-------------------|---------|------------|--------|------------|--------|----------| | 2^20 = 1048576 | 1408x | 2.600 ms | 11.28% | 2.592 ms | 11.26% | 404.547M | | 2^21 = 2097152 | 800x | 2.838 ms | 7.68% | 2.829 ms | 7.67% | 741.243M | | 2^22 = 4194304 | 2752x | 3.719 ms | 9.24% | 3.710 ms | 9.23% | 1.130G | | 2^23 = 8388608 | 128x | 4.855 ms | 3.38% | 4.846 ms | 3.37% | 1.731G | | 2^24 = 16777216 | 720x | 7.029 ms | 4.67% | 7.021 ms | 4.66% | 2.390G | | 2^25 = 33554432 | 832x | 10.760 ms | 3.83% | 10.751 ms | 3.83% | 3.121G | | 2^26 = 67108864 | 576x | 17.961 ms | 2.86% | 17.953 ms | 2.86% | 3.738G | | 2^27 = 134217728 | 461x | 32.550 ms | 2.13% | 32.542 ms | 2.13% | 4.124G | | 2^28 = 268435456 | 243x | 61.813 ms | 1.60% | 61.805 ms | 1.60% | 4.343G | | 2^29 = 536870912 | 125x | 120.445 ms | 1.21% | 120.437 ms | 1.21% | 4.458G | | 2^30 = 1073741824 | 66x | 228.833 ms | 0.75% | 228.825 ms | 0.75% | 4.692G | ``` ### JSON Parser The overall parser performance is obviously impacted as we're now also doing type conversion instead of just returning string columns. #### Before ``` # Benchmark Results ## nested_json_gpu_parser ### [0] Tesla V100-SXM2-32GB | string_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | |-------------------|---------|------------|-------|------------|-------|----------| | 2^20 = 1048576 | 1040x | 7.361 ms | 5.61% | 7.353 ms | 5.61% | 142.614M | | 2^21 = 2097152 | 832x | 11.549 ms | 3.63% | 11.541 ms | 3.63% | 181.708M | | 2^22 = 4194304 | 740x | 20.264 ms | 2.98% | 20.257 ms | 2.98% | 207.054M | | 2^23 = 8388608 | 407x | 36.844 ms | 2.26% | 36.837 ms | 2.26% | 227.724M | | 2^24 = 16777216 | 80x | 75.590 ms | 1.95% | 75.582 ms | 1.95% | 221.974M | | 2^25 = 33554432 | 80x | 179.442 ms | 4.40% | 179.434 ms | 4.40% | 187.001M | | 2^26 = 67108864 | 40x | 379.821 ms | 0.98% | 379.815 ms | 0.98% | 176.688M | | 2^27 = 134217728 | 20x | 777.351 ms | 1.72% | 777.347 ms | 1.72% | 172.661M | | 2^28 = 268435456 | 10x | 1.550 s | 0.99% | 1.550 s | 0.99% | 173.212M | | 2^29 = 536870912 | 5x | 3.055 s | 0.41% | 3.055 s | 0.41% | 175.749M | | 2^30 = 1073741824 | 3x | 6.315 s | inf% | 6.315 s | inf% | 170.018M | ``` #### After ``` | string_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | |-------------------|---------|------------|-------|------------|-------|----------| | 2^20 = 1048576 | 1568x | 7.908 ms | 5.24% | 7.900 ms | 5.24% | 132.730M | | 2^21 = 2097152 | 576x | 12.235 ms | 3.24% | 12.228 ms | 3.24% | 171.509M | | 2^22 = 4194304 | 192x | 21.171 ms | 2.09% | 21.164 ms | 2.09% | 198.182M | | 2^23 = 8388608 | 96x | 38.990 ms | 1.96% | 38.983 ms | 1.96% | 215.188M | | 2^24 = 16777216 | 192x | 78.414 ms | 2.21% | 78.407 ms | 2.21% | 213.977M | | 2^25 = 33554432 | 81x | 187.007 ms | 6.47% | 187.000 ms | 6.47% | 179.435M | | 2^26 = 67108864 | 38x | 400.007 ms | 1.59% | 400.000 ms | 1.59% | 167.772M | | 2^27 = 134217728 | 19x | 801.575 ms | 1.29% | 801.571 ms | 1.29% | 167.443M | | 2^28 = 268435456 | 10x | 1.590 s | 0.42% | 1.590 s | 0.42% | 168.799M | | 2^29 = 536870912 | 5x | 3.150 s | 0.40% | 3.150 s | 0.40% | 170.456M | | 2^30 = 1073741824 | 3x | 6.402 s | inf% | 6.402 s | inf% | 167.712M | ``` ## Supported escape sequences: ``` \" represents the quotation mark character (U+0022). \\ represents the reverse solidus character (U+005C). \/ represents the solidus character (U+002F). \b represents the backspace character (U+0008). \f represents the form feed character (U+000C). \n represents the line feed character (U+000A). \r represents the carriage return character (U+000D). \t represents the character tabulation character (U+0009). \uDDDD, where `D` is a hex digit 0-9, a-f, A-F, for code points on the MBP \uDDDD\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, representing UTF-16 surrogate pairs for remaining unicode code points ``` Authors: - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Michael Wang (https://github.com/isVoid) - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: #11574
- Loading branch information