PERF: json_normalize, for basic use case #40035

smpurkis · 2021-02-24T23:10:37Z

closes PERF: json_normalize #15621
tests passed
Ensure all linting tests pass, see here for how to run them

Proposed speed up for very simple use cases using the pd.json_normalize function. E.g. pd.json_normalize(data)

The speed up can be seen in this example:

import datetime
import pandas as pd

# example json data
data = {"hello": ["thisisatest", 999898, datetime.date.today()],
        "nest1": {"nest2": {"nest3": "nest3_value", "nest3_int": 3445}},
        "nest1_list": {"nest2": ["blah", 32423, 546456.876, 92030234]},
        "hello2": "string"}

hundred_thousand_rows = [data for i in range(100000)]

s = time()
pd.json_normalize(hundred_thousand_rows)
pandas_json_normalize_time_taken = time() - s
print(f"\npandas time taken for a 100,000 rows: {pandas_json_normalize_time_taken} seconds")

With output from Pandas 1.2.2: pandas time taken for a 100,000 rows: 3.0518009662628174 seconds
From this branch: pandas time taken for a 100,000 rows: 0.632451057434082 seconds

To show tests pass for the appropriate file, ran pytest pandas/tests/io/json/test_normalize.py -v test_normalize.py_pytest.log

To show pre-commit passed on file, ran pre-commit run --files pandas/io/json/_normalize.py _normalize.py_pre-commit.log

There was one code check that was caught, running ./ci/code_checks.sh code_checks.log

pandas/io/json/_normalize.py:208: error: Incompatible types in assignment (expression has type "List[Union[List[Dict[Any, Any]], Dict[Any, Any]]]", variable has type "Dict[Any, Any]")  [assignment]

As it a type hint issue, decided to still make pull request. Can you advice on how to fix, I'm fairly new to type hints?

Kind regards,
Sam

…use it for simple use cases

smpurkis · 2021-02-24T23:15:56Z

Just remembered forgot to update the changelog, woopsy

pandas/io/json/_normalize.py

jreback

is it possible to simply dispatch to this if the basic case is selected?

why is ordering not preserved?

do we have sufficient asv's for this? e.g. pls add the cases you are measuring.

smpurkis · 2021-02-25T14:06:15Z

@jreback

is it possible to simply dispatch to this if the basic case is selected?

If possible that would be best, but am not familiar enough with pandas codebase to know where to look. Have tried looking around the pandas/io/json/ but can't find where dispatch configuration is setup. Any advise for where to look/how to get started would be appreciated.

why is ordering not preserved?

Oh it is, I need to update that part of the comment.

do we have sufficient asv's for this? e.g. pls add the cases you are measuring.

I do, I will add those in at the next opportunity and think of a few more cases.

What would you recommend to fix the type hint issue I have?

pandas/io/json/_normalize.py:208: error: Incompatible types in assignment (expression has type "List[Union[List[Dict[Any, Any]], Dict[Any, Any]]]", variable has type "Dict[Any, Any]")  [assignment]

…function comment

pep8speaks · 2021-02-25T20:10:26Z

Hello @smpurkis! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-03 19:07:29 UTC

smpurkis · 2021-02-25T20:13:41Z

Added asv, as there were none for json_normalize function. Although can't run locally as machine isn't powerful enough.
Have also touched up a comment and condition statement

smpurkis · 2021-02-25T21:06:27Z

asv failing, I've not written any before and my laptop is too slow to run them

smpurkis · 2021-02-26T10:32:40Z

asv benchmark should be working now

WillAyd · 2021-02-26T17:23:52Z

Can you run the related JSON benchmarks and post the output of them here?

smpurkis · 2021-02-26T18:18:40Z

Running

asv continuous -f 1.1 -E virtualenv upstream/master HEAD -b json

gave this log:
asv-json-benchmark.log

jreback · 2021-02-26T23:59:35Z

so net effect is this

     <issue-15621-improve-json-normalize-perf>       <master>  
         320±2ms        285±0.5ms     0.89  io.json.NormalizeJSON.time_normalize_json('values', 'df_int_floats')
        326±20ms          277±9ms     0.85  io.json.NormalizeJSON.time_normalize_json('records', 'df_int_floats')

?

jreback · 2021-02-26T23:59:53Z

this doesnt' seem to match your results.

smpurkis · 2021-02-27T00:10:31Z

this doesnt' seem to match your results.

Can you please explain a bit more on how you are comparing them. The test is the same, but asv sets up its own environment which surely would change the times.
Unless asv runs the same benchmark on pandas master, not sure how the results are useful

jreback · 2021-02-27T00:21:21Z

if you added an asv then we could see

this doesn't thatch your timings from the top (the ratio not the absolute time)

smpurkis · 2021-02-27T00:36:19Z

Assuming asv is comparing my forked master to main master then my addition in asv might be wrong.
Will have another look, and do some reading up on asv

smpurkis · 2021-02-27T09:58:18Z

Found the issue, my checking of the parameters was incorrect. Reran the benchmark.

       before           after         ratio
     [a241cfc6]       [5959eaab]
     <issue-15621-improve-json-normalize-perf>       <master>  
-         317±1ms         65.5±2ms     0.21  io.json.NormalizeJSON.time_normalize_json('values', 'df_date_idx')
-       317±0.5ms       65.4±0.3ms     0.21  io.json.NormalizeJSON.time_normalize_json('split', 'df_date_idx')
-         317±2ms       65.3±0.9ms     0.21  io.json.NormalizeJSON.time_normalize_json('values', 'df_td_int_ts')
-         316±2ms       65.2±0.5ms     0.21  io.json.NormalizeJSON.time_normalize_json('index', 'df_date_idx')
-         317±1ms       65.4±0.4ms     0.21  io.json.NormalizeJSON.time_normalize_json('index', 'df_int_floats')
-       316±0.9ms       65.1±0.3ms     0.21  io.json.NormalizeJSON.time_normalize_json('values', 'df')
-       315±0.8ms       64.9±0.1ms     0.21  io.json.NormalizeJSON.time_normalize_json('columns', 'df_td_int_ts')
-       316±0.4ms       65.0±0.2ms     0.21  io.json.NormalizeJSON.time_normalize_json('index', 'df')
-       316±0.6ms       64.9±0.4ms     0.21  io.json.NormalizeJSON.time_normalize_json('split', 'df_td_int_ts')
-         316±2ms       64.9±0.3ms     0.21  io.json.NormalizeJSON.time_normalize_json('split', 'df_int_floats')
-       316±0.6ms       64.8±0.2ms     0.21  io.json.NormalizeJSON.time_normalize_json('records', 'df')
-         317±1ms       65.0±0.2ms     0.21  io.json.NormalizeJSON.time_normalize_json('records', 'df_date_idx')
-         317±1ms       65.0±0.5ms     0.21  io.json.NormalizeJSON.time_normalize_json('index', 'df_int_float_str')
-       317±0.4ms       65.1±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('records', 'df_int_floats')
-       316±0.7ms       64.8±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('values', 'df_int_floats')
-       316±0.2ms       64.8±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('columns', 'df_int_floats')
-         315±1ms       64.5±0.3ms     0.20  io.json.NormalizeJSON.time_normalize_json('split', 'df_int_float_str')
-         318±1ms       64.9±0.1ms     0.20  io.json.NormalizeJSON.time_normalize_json('split', 'df')
-       317±0.9ms       64.8±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('columns', 'df_date_idx')
-         317±1ms       64.8±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('values', 'df_int_float_str')
-         318±1ms       64.8±0.3ms     0.20  io.json.NormalizeJSON.time_normalize_json('records', 'df_td_int_ts')
-       318±0.8ms       64.8±0.4ms     0.20  io.json.NormalizeJSON.time_normalize_json('records', 'df_int_float_str')
-         319±5ms       65.1±0.5ms     0.20  io.json.NormalizeJSON.time_normalize_json('columns', 'df_int_float_str')
-         317±2ms       64.6±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('columns', 'df')
-         319±2ms       64.9±0.4ms     0.20  io.json.NormalizeJSON.time_normalize_json('index', 'df_td_int_ts')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

…ng tests

jreback

look pretty good. can you add a whatsnew note in the 1.3 Perf section.

pandas/io/json/_normalize.py

…ter check explicit, moved nested functions to module level

smpurkis · 2021-02-28T00:15:02Z

Have made the whatsnew note and the changes you advised.
There is still the type hint issue originally picked up, haven't found a way to correct it

pandas/io/json/_normalize.py

…proved type hints

jreback · 2021-03-05T00:39:15Z

thanks @smpurkis very nice!

smpurkis added 5 commits February 24, 2021 21:19

[json_normalize] added fast json normalize function and the logic to …

03ccc22

…use it for simple use cases

[json_normalize] clean up code

9b25fff

[json_normalize] clean up code

e5b32d7

PERF [json_normalize] clean up added function for simple use cases

588aed2

PERF [json_normalize] fixed up the condition to check parameters

650f839

luckydenis reviewed Feb 25, 2021

View reviewed changes

pandas/io/json/_normalize.py Outdated Show resolved Hide resolved

jreback requested changes Feb 25, 2021

View reviewed changes

jreback added the IO JSON read_json, to_json, json_normalize label Feb 25, 2021

smpurkis added 3 commits February 25, 2021 19:11

PERF [json_normalize] refined parameter condition and main recursive …

64dfd7c

…function comment

PERF add ASV benchmark

5c21709

PERF [json_normalize] remove unnecessary pd

f859f71

PERF removed too many lines

6bbea70

PERF [asv] correct asv format, new benchmark

ce34172

corrected format

0c9a56b

PERF [json_normalize] fix if statement condition

5959eaa

PERF [json_normalize] reverted condition to original, as new is faili…

85d3e7a

…ng tests

jreback requested changes Feb 27, 2021

View reviewed changes

jreback added this to the 1.3 milestone Feb 27, 2021

jreback added the Performance Memory or execution speed performance label Feb 27, 2021

smpurkis added 4 commits February 27, 2021 23:32

PERF [json_normalize] reverted formatting done by my IDE, made parame…

1162e7e

…ter check explicit, moved nested functions to module level

PERF [json_normalize] minor formatting

0e01060

PERF [json_normalize] add more detail to comment

9db8af9

PERF [v1.3.0.rst] added note to whatsnew perf section

44779b9

jreback requested changes Mar 1, 2021

View reviewed changes

pandas/io/json/_normalize.py Show resolved Hide resolved

pandas/io/json/_normalize.py Show resolved Hide resolved

PERF [json normalize] added docstrings for top level functions and im…

5fcbe1d

…proved type hints

jreback approved these changes Mar 5, 2021

View reviewed changes

jreback merged commit c7d3e9b into pandas-dev:master Mar 5, 2021

OneMoreSecond mentioned this pull request Dec 23, 2021

BUG: pd.json_normalize doesn't work well if len(sep) != 1 #45021

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: json_normalize, for basic use case #40035

PERF: json_normalize, for basic use case #40035

smpurkis commented Feb 24, 2021 •

edited

Loading

smpurkis commented Feb 24, 2021 •

edited

Loading

jreback left a comment

smpurkis commented Feb 25, 2021 •

edited

Loading

pep8speaks commented Feb 25, 2021 •

edited

Loading

smpurkis commented Feb 25, 2021

smpurkis commented Feb 25, 2021

smpurkis commented Feb 26, 2021

WillAyd commented Feb 26, 2021

smpurkis commented Feb 26, 2021

jreback commented Feb 26, 2021

jreback commented Feb 26, 2021

smpurkis commented Feb 27, 2021 •

edited

Loading

jreback commented Feb 27, 2021

smpurkis commented Feb 27, 2021

smpurkis commented Feb 27, 2021

jreback left a comment

smpurkis commented Feb 28, 2021

jreback commented Mar 5, 2021

PERF: json_normalize, for basic use case #40035

PERF: json_normalize, for basic use case #40035

Conversation

smpurkis commented Feb 24, 2021 • edited Loading

smpurkis commented Feb 24, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

smpurkis commented Feb 25, 2021 • edited Loading

pep8speaks commented Feb 25, 2021 • edited Loading

Comment last updated at 2021-03-03 19:07:29 UTC

smpurkis commented Feb 25, 2021

smpurkis commented Feb 25, 2021

smpurkis commented Feb 26, 2021

WillAyd commented Feb 26, 2021

smpurkis commented Feb 26, 2021

jreback commented Feb 26, 2021

jreback commented Feb 26, 2021

smpurkis commented Feb 27, 2021 • edited Loading

jreback commented Feb 27, 2021

smpurkis commented Feb 27, 2021

smpurkis commented Feb 27, 2021

jreback left a comment

Choose a reason for hiding this comment

smpurkis commented Feb 28, 2021

jreback commented Mar 5, 2021

smpurkis commented Feb 24, 2021 •

edited

Loading

smpurkis commented Feb 24, 2021 •

edited

Loading

smpurkis commented Feb 25, 2021 •

edited

Loading

pep8speaks commented Feb 25, 2021 •

edited

Loading

smpurkis commented Feb 27, 2021 •

edited

Loading