Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSONL read support #1447

Merged
merged 9 commits into from
Feb 5, 2025
Merged

Add JSONL read support #1447

merged 9 commits into from
Feb 5, 2025

Conversation

jmortlock
Copy link
Contributor

@jmortlock jmortlock commented Feb 5, 2025

Change Log

Added

  • Add read support for JSONL files

Fixed

Changed

Removed

Deprecated

Security


Description

Closes #1443

Support reading from JSONL based data sources.

Copy link
Contributor

github-actions bot commented Feb 5, 2025

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.792mb +0.11%  | 562.003ms -3.08% | ±1.01% +773.57% |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.865mb +0.11%  | 1.055s -1.84%    | ±0.86% -66.37%  |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 86.306mb +0.01% | 900.726ms -1.75% | ±0.93% +43.58%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.522mb +0.12%  | 35.156ms -1.72%  | ±1.40% +955.62% |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.497mb +0.12%  | 605.239ms -0.28% | ±0.42% -55.10%  |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 127.319mb +0.00% | 71.041ms -4.28% | ±0.42% -58.09% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev         |
+--------------------+----------------+------+-----+------------------+------------------+----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 63.990mb +0.01%  | 103.139ms -2.06% | ±0.42% -81.31% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 84.338mb +0.00%  | 96.462ms -3.94%  | ±0.23% -71.63% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 161.177mb +0.00% | 20.587s -2.37%   | ±0.34% -3.86%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.989mb +0.03%  | 31.396ms -2.27%  | ±0.43% -19.01% |
+--------------------+----------------+------+-----+------------------+------------------+----------------+
Building Blocks
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 105.961mb +0.01% | 455.617ms -2.06% | ±1.11% +104.70% |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 55.152mb +0.01%  | 235.324ms +2.19% | ±1.24% +62.18%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.674mb +0.04%  | 51.174ms -0.01%  | ±1.27% +116.60% |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 97.005mb +0.01%  | 3.624ms +13.41%  | ±0.89% +7.92%   |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 114.287mb +0.00% | 186.569ms -1.72% | ±0.20% -71.42%  |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 97.007mb +0.01%  | 19.384ms +0.53%  | ±0.80% +29.48%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 97.880mb +0.01%  | 1.843ms +15.80%  | ±2.56% +19.53%  |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 97.880mb +0.01%  | 1.814ms +17.74%  | ±1.61% -40.00%  |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 96.041mb +0.01%  | 5.035ms +12.47%  | ±0.38% -85.97%  |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 96.570mb +0.01%  | 17.315ms +0.61%  | ±1.11% -67.32%  |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 96.570mb +0.01%  | 17.084ms -2.06%  | ±1.06% +119.01% |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 95.261mb +0.01%  | 1.906μs -4.70%   | ±2.44% +0.00%   |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 95.261mb +0.01%  | 0.400μs 0.00%    | ±0.00% 0.00%    |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 104.479mb +0.01% | 15.142ms +3.71%  | ±1.10% +32.10%  |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 134.546mb +0.00% | 72.673ms -7.04%  | ±0.33% -14.42%  |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 97.089mb +0.01%  | 1.488ms +1.20%   | ±2.59% +85.76%  |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 100.386mb +0.01% | 64.687ms -4.46%  | ±0.57% +52.50%  |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 98.142mb +0.01%  | 3.929ms -4.32%   | ±3.46% +12.34%  |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 95.549mb +0.01%  | 42.636ms -4.20%  | ±0.39% -18.33%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 95.549mb +0.01%  | 42.850ms -1.17%  | ±0.48% +12.09%  |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 95.549mb +0.01%  | 42.807ms -3.04%  | ±2.10% +124.46% |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 97.701mb +0.01%  | 8.370ms -0.74%   | ±1.60% +143.23% |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 95.451mb +0.01%  | 30.413ms +1.10%  | ±0.92% -46.49%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 95.261mb +0.01%  | 14.146μs -17.18% | ±3.20% +222.88% |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 95.261mb +0.01%  | 15.900μs -15.80% | ±0.00% -100.00% |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 114.288mb +0.00% | 192.255ms +1.06% | ±1.09% +243.26% |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 43.797mb +0.01%  | 365.167ms +1.32% | ±0.85% +118.40% |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.607mb +0.05%  | 73.279ms +0.23%  | ±1.62% -13.37%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@jmortlock
Copy link
Contributor Author

For the tests I just replicated the existing ones, might be cleaner to use a DataProvider so there is not so much duplication, If you want I can switch it over.

As for the implementation I though it simpler to change the reading of the json lines to match what the original json behaviour is doing, that is it iterates out one complete object at a time. When use a pointer its assumed (like currently) that its pointing at an array subentry.

@norberttech
Copy link
Member

For the tests I just replicated the existing ones, might be cleaner to use a DataProvider so there is not so much duplication, If you want I can switch it over.

I actually don't mind that duplication,

As for the implementation I though it simpler to change the reading of the json lines to match what the original json behaviour is doing, that is it iterates out one complete object at a time. When use a pointer its assumed (like currently) that its pointing at an array subentry.

From my point of view the pointer should be either skipped when we are working with jsonl or we can literally extract separated JsonLinesExtractor and JsonLinesLoader.

The only way to make pointer works here from my point of view is to still read jsonl line by line and then apply pointer on each line. However JsonMachine library that we are using under the hood does not seem to support jsonl (but I just briefly checked their readme) we might be better making jsonl without support for pointers (also they are pretty much solving the same problem just in a different way).

So the question is, how do you feel about extracting Json Lines logic to new classes?

It would be something like this:

  • JsonLinesExtractor - from_json_lines()
  • JsonLinesLoader - to_json_lines()

just without the pointer option on JsonLinesExtractor.

Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 88.50575% with 10 lines in your changes missing coverage. Please review.

Project coverage is 83.02%. Comparing base (0d06018) to head (edea446).
Report is 4 commits behind head on 1.x.

Additional details and impacted files
@@            Coverage Diff             @@
##              1.x    #1447      +/-   ##
==========================================
+ Coverage   83.00%   83.02%   +0.02%     
==========================================
  Files         659      661       +2     
  Lines       17727    17805      +78     
==========================================
+ Hits        14714    14783      +69     
- Misses       3013     3022       +9     
Components Coverage Δ
etl 85.75% <ø> (ø)
cli 86.73% <ø> (ø)
lib-array-dot 94.53% <ø> (ø)
lib-azure-sdk 62.56% <ø> (ø)
lib-doctrine-dbal-bulk 97.36% <ø> (ø)
lib-filesystem 76.75% <ø> (ø)
lib-parquet 84.33% <ø> (ø)
lib-parquet-viewer 82.02% <ø> (ø)
lib-rdsl 87.09% <ø> (ø)
lib-snappy 91.16% <ø> (+0.46%) ⬆️
bridge-filesystem-async-aws 90.38% <ø> (ø)
bridge-filesystem-azure 89.92% <ø> (ø)
bridge-monolog-http 96.38% <ø> (ø)
symfony-http-foundation 77.10% <ø> (ø)
adapter-chartjs 86.45% <ø> (ø)
adapter-csv 89.49% <ø> (ø)
adapter-doctrine 88.54% <ø> (ø)
adapter-elasticsearch 97.19% <ø> (ø)
adapter-google-sheet 78.04% <ø> (ø)
adapter-http 59.15% <ø> (ø)
adapter-json 90.81% <88.50%> (-2.41%) ⬇️
adapter-logger 53.84% <ø> (ø)
adapter-meilisearch 97.75% <ø> (ø)
adapter-parquet 80.85% <ø> (ø)
adapter-text 84.44% <ø> (ø)
adapter-xml 83.15% <ø> (ø)

@jmortlock
Copy link
Contributor Author

For the tests I just replicated the existing ones, might be cleaner to use a DataProvider so there is not so much duplication, If you want I can switch it over.

I actually don't mind that duplication,

As for the implementation I though it simpler to change the reading of the json lines to match what the original json behaviour is doing, that is it iterates out one complete object at a time. When use a pointer its assumed (like currently) that its pointing at an array subentry.

From my point of view the pointer should be either skipped when we are working with jsonl or we can literally extract separated JsonLinesExtractor and JsonLinesLoader.

The only way to make pointer works here from my point of view is to still read jsonl line by line and then apply pointer on each line. However JsonMachine library that we are using under the hood does not seem to support jsonl (but I just briefly checked their readme) we might be better making jsonl without support for pointers (also they are pretty much solving the same problem just in a different way).

So the question is, how do you feel about extracting Json Lines logic to new classes?

It would be something like this:

* JsonLinesExtractor - `from_json_lines()`

* JsonLinesLoader - `to_json_lines()`

just without the pointer option on JsonLinesExtractor.

Yup I like the sounds of that, and in my head was thinking along the same lines.
Ill push up an alternate PR with your recommendations and we can go from there.

@jmortlock
Copy link
Contributor Author

Actually I do remember why i though pointer might still be useful for JSONL is that it could still save one level of transformation. For example say you have a JSONL file

{ data: [], pagination: [] }
{ data: [], pagination: [] }
{ data: [], pagination: [] }

You might just want to transform the "data" elements into your dataframe rows.

The main difference from a JSONMachine point of view is the iterating over object vs array as mentioned above.

@norberttech
Copy link
Member

Ill push up an alternate PR with your recommendations and we can go from there.

You can even push it directly here, it would help us to keep the conversation history in one place in case anyone would like to know why we decided to separate them 😁

You might just want to transform the "data" elements into your dataframe rows.

The main difference from a JSONMachine point of view is the iterating over object vs array as mentioned above.

Yeah, that's a good point, we can keep it but apply it independently on each row

@github-actions github-actions bot added size: L and removed size: M labels Feb 5, 2025
@jmortlock
Copy link
Contributor Author

Alright I've separated out the JSONLExtractor see what you think, once your happy ill do the loader separation.

@norberttech
Copy link
Member

Alright I've separated out the JSONLExtractor see what you think, once your happy ill do the loader separation.

I like it a lot! Left few minor comments but nothing critical, irrelevant details

@github-actions github-actions bot added size: XL and removed size: L labels Feb 5, 2025
@jmortlock
Copy link
Contributor Author

I think I have addressed all the feedback, also extracted out the JsonLinesLoader.

Thanks for the prompt feedback is handy

@norberttech
Copy link
Member

Thank you @jmortlock looks awesome!! 🤩

@norberttech norberttech merged commit 3e67c29 into flow-php:1.x Feb 5, 2025
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading from datasets saved as jsonl
2 participants