-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JSONL read support #1447
Add JSONL read support #1447
Conversation
src/adapter/etl-adapter-json/src/Flow/ETL/Adapter/JSON/JSONMachine/JsonExtractor.php
Outdated
Show resolved
Hide resolved
Flow PHP - BenchmarksResults of the benchmarks from this PR are compared with the results from 1.x branch. Extractors+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench | bench_extract_10k | 1 | 3 | 4.792mb +0.11% | 562.003ms -3.08% | ±1.01% +773.57% |
| JsonExtractorBench | bench_extract_10k | 1 | 3 | 4.865mb +0.11% | 1.055s -1.84% | ±0.86% -66.37% |
| ParquetExtractorBench | bench_extract_10k | 1 | 3 | 86.306mb +0.01% | 900.726ms -1.75% | ±0.93% +43.58% |
| TextExtractorBench | bench_extract_10k | 1 | 3 | 4.522mb +0.12% | 35.156ms -1.72% | ±1.40% +955.62% |
| XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.497mb +0.12% | 605.239ms -0.28% | ±0.42% -55.10% |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
Transformers+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 127.319mb +0.00% | 71.041ms -4.28% | ±0.42% -58.09% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders+--------------------+----------------+------+-----+------------------+------------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+--------------------+----------------+------+-----+------------------+------------------+----------------+
| CSVLoaderBench | bench_load_10k | 1 | 3 | 63.990mb +0.01% | 103.139ms -2.06% | ±0.42% -81.31% |
| JsonLoaderBench | bench_load_10k | 1 | 3 | 84.338mb +0.00% | 96.462ms -3.94% | ±0.23% -71.63% |
| ParquetLoaderBench | bench_load_10k | 1 | 3 | 161.177mb +0.00% | 20.587s -2.37% | ±0.34% -3.86% |
| TextLoaderBench | bench_load_10k | 1 | 3 | 17.989mb +0.03% | 31.396ms -2.27% | ±0.43% -19.01% |
+--------------------+----------------+------+-----+------------------+------------------+----------------+
Building Blocks+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 105.961mb +0.01% | 455.617ms -2.06% | ±1.11% +104.70% |
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 55.152mb +0.01% | 235.324ms +2.19% | ±1.24% +62.18% |
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 14.674mb +0.04% | 51.174ms -0.01% | ±1.27% +116.60% |
| RowsBench | bench_chunk_10_on_10k | 2 | 3 | 97.005mb +0.01% | 3.624ms +13.41% | ±0.89% +7.92% |
| RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 114.287mb +0.00% | 186.569ms -1.72% | ±0.20% -71.42% |
| RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 97.007mb +0.01% | 19.384ms +0.53% | ±0.80% +29.48% |
| RowsBench | bench_drop_1k_on_10k | 2 | 3 | 97.880mb +0.01% | 1.843ms +15.80% | ±2.56% +19.53% |
| RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 97.880mb +0.01% | 1.814ms +17.74% | ±1.61% -40.00% |
| RowsBench | bench_entries_on_10k | 2 | 3 | 96.041mb +0.01% | 5.035ms +12.47% | ±0.38% -85.97% |
| RowsBench | bench_filter_on_10k | 2 | 3 | 96.570mb +0.01% | 17.315ms +0.61% | ±1.11% -67.32% |
| RowsBench | bench_find_on_10k | 2 | 3 | 96.570mb +0.01% | 17.084ms -2.06% | ±1.06% +119.01% |
| RowsBench | bench_find_one_on_10k | 10 | 3 | 95.261mb +0.01% | 1.906μs -4.70% | ±2.44% +0.00% |
| RowsBench | bench_first_on_10k | 10 | 3 | 95.261mb +0.01% | 0.400μs 0.00% | ±0.00% 0.00% |
| RowsBench | bench_flat_map_on_1k | 2 | 3 | 104.479mb +0.01% | 15.142ms +3.71% | ±1.10% +32.10% |
| RowsBench | bench_map_on_10k | 2 | 3 | 134.546mb +0.00% | 72.673ms -7.04% | ±0.33% -14.42% |
| RowsBench | bench_merge_1k_on_10k | 2 | 3 | 97.089mb +0.01% | 1.488ms +1.20% | ±2.59% +85.76% |
| RowsBench | bench_partition_by_on_10k | 2 | 3 | 100.386mb +0.01% | 64.687ms -4.46% | ±0.57% +52.50% |
| RowsBench | bench_remove_on_10k | 2 | 3 | 98.142mb +0.01% | 3.929ms -4.32% | ±3.46% +12.34% |
| RowsBench | bench_sort_asc_on_1k | 2 | 3 | 95.549mb +0.01% | 42.636ms -4.20% | ±0.39% -18.33% |
| RowsBench | bench_sort_by_on_1k | 2 | 3 | 95.549mb +0.01% | 42.850ms -1.17% | ±0.48% +12.09% |
| RowsBench | bench_sort_desc_on_1k | 2 | 3 | 95.549mb +0.01% | 42.807ms -3.04% | ±2.10% +124.46% |
| RowsBench | bench_sort_entries_on_1k | 2 | 3 | 97.701mb +0.01% | 8.370ms -0.74% | ±1.60% +143.23% |
| RowsBench | bench_sort_on_1k | 2 | 3 | 95.451mb +0.01% | 30.413ms +1.10% | ±0.92% -46.49% |
| RowsBench | bench_take_1k_on_10k | 10 | 3 | 95.261mb +0.01% | 14.146μs -17.18% | ±3.20% +222.88% |
| RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 95.261mb +0.01% | 15.900μs -15.80% | ±0.00% -100.00% |
| RowsBench | bench_unique_on_1k | 2 | 3 | 114.288mb +0.00% | 192.255ms +1.06% | ±1.09% +243.26% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 43.797mb +0.01% | 365.167ms +1.32% | ±0.85% +118.40% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 11.607mb +0.05% | 73.279ms +0.23% | ±1.62% -13.37% |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
|
For the tests I just replicated the existing ones, might be cleaner to use a DataProvider so there is not so much duplication, If you want I can switch it over. As for the implementation I though it simpler to change the reading of the json lines to match what the original json behaviour is doing, that is it iterates out one complete object at a time. When use a pointer its assumed (like currently) that its pointing at an array subentry. |
I actually don't mind that duplication,
From my point of view the pointer should be either skipped when we are working with jsonl or we can literally extract separated JsonLinesExtractor and JsonLinesLoader. The only way to make pointer works here from my point of view is to still read jsonl line by line and then apply pointer on each line. However JsonMachine library that we are using under the hood does not seem to support jsonl (but I just briefly checked their readme) we might be better making jsonl without support for pointers (also they are pretty much solving the same problem just in a different way). So the question is, how do you feel about extracting Json Lines logic to new classes? It would be something like this:
just without the pointer option on JsonLinesExtractor. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 1.x #1447 +/- ##
==========================================
+ Coverage 83.00% 83.02% +0.02%
==========================================
Files 659 661 +2
Lines 17727 17805 +78
==========================================
+ Hits 14714 14783 +69
- Misses 3013 3022 +9
|
Yup I like the sounds of that, and in my head was thinking along the same lines. |
Actually I do remember why i though pointer might still be useful for JSONL is that it could still save one level of transformation. For example say you have a JSONL file { data: [], pagination: [] } You might just want to transform the "data" elements into your dataframe rows. The main difference from a JSONMachine point of view is the iterating over object vs array as mentioned above. |
You can even push it directly here, it would help us to keep the conversation history in one place in case anyone would like to know why we decided to separate them 😁
Yeah, that's a good point, we can keep it but apply it independently on each row |
Alright I've separated out the JSONLExtractor see what you think, once your happy ill do the loader separation. |
src/adapter/etl-adapter-json/src/Flow/ETL/Adapter/JSON/JSONMachine/JsonlExtractor.php
Outdated
Show resolved
Hide resolved
src/adapter/etl-adapter-json/src/Flow/ETL/Adapter/JSON/functions.php
Outdated
Show resolved
Hide resolved
I like it a lot! Left few minor comments but nothing critical, irrelevant details |
I think I have addressed all the feedback, also extracted out the JsonLinesLoader. Thanks for the prompt feedback is handy |
Thank you @jmortlock looks awesome!! 🤩 |
Change Log
Added
Fixed
Changed
Removed
Deprecated
Security
Description
Closes #1443
Support reading from JSONL based data sources.