Some feedback #1

Dr-Irv · 2020-12-18T19:53:40Z

I saw your note to the pandas-dev list and have some feedback:

I don't agree with your statement about querying data. The .query method is easier to read and understand. Using the pandas expressions can be difficult to parse when you have complex expressions and long DataFrame names, for example:

my_big_dataframe[my_big_dataframe['order_date'] >= "20201001" 
                              & my_big_dataframe['order_date'] <= "20201031 & 
                              & my_big_dataframe['customer'] == "Apple"]

versus

my_big_dataframe.query('order_date >= "20201001" and order_date <= "20201031" and customer == "Apple"']

In addition, "query" statements can be dynamically formatted.

If you believe otherwise, could you add text as to why you don't prefer .query ?

You might want to start using the new nullable types (String, Int64, etc.) and pd.NA in your examples
In the "column selection" section, one advantage of using something like df.column is that if you are in a notebook, you can get autocompletion, which can help with long column names. But your point that all names might not work is also correct.
You might want to take a look at Tom Augspurger's "Modern Pandas" for more ideas: https://tomaugspurger.github.io/modern-1-intro.html

Hope this helps.

The text was updated successfully, but these errors were encountered:

joshlk · 2020-12-28T12:51:38Z

Hi @Dr-Irv, thank you for reviewing the document - much appreciated 😃. Here are my replies to your comments:

Querying data: I agree that .query can be better when dataframe names are long. However, you can easily use a shorthand variable name in these circumstances e.g.

df = very_long_dataframe_name
even_longer_dataframe_name = df[df['col1'] > df['col2']]

By using Python expressions it also encourages reuse and naming of query expressions, for example:

df = car_ownership_2010_to_2020
has_car = df['driving_car'] & (df['miles_drived'] > 0) & (df['age'] > 15)
is_insured = df['is_insured'] & df['no_claim_bonus_2020']
drivers_mask = has_car & is_insured

drivers = df[drivers_mask]
non_drivers = df[~drivers_mask]

Would you agree?

Null types: yes good point. Thanks
Column selection: I agree this feature is useful when using notebooks as it provides autocompletion. However, this style guide is intended for production code and so more likely written using plain python files and this advantage is lost.
Thanks for the link. I've added a section about "Avoid chained indexing" prompted by these blog posts.

To track changes inspired by your comments I have added them to a separate PR: #2 .

Thanks again for your contribution,
Josh

joshlk mentioned this issue Dec 28, 2020

Changes inspired by Dr-Irv #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some feedback #1

Some feedback #1

Dr-Irv commented Dec 18, 2020

joshlk commented Dec 28, 2020

Some feedback #1

Some feedback #1

Comments

Dr-Irv commented Dec 18, 2020

joshlk commented Dec 28, 2020