Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some feedback #1

Open
Dr-Irv opened this issue Dec 18, 2020 · 1 comment
Open

Some feedback #1

Dr-Irv opened this issue Dec 18, 2020 · 1 comment

Comments

@Dr-Irv
Copy link

Dr-Irv commented Dec 18, 2020

I saw your note to the pandas-dev list and have some feedback:

  1. I don't agree with your statement about querying data. The .query method is easier to read and understand. Using the pandas expressions can be difficult to parse when you have complex expressions and long DataFrame names, for example:
my_big_dataframe[my_big_dataframe['order_date'] >= "20201001" 
                              & my_big_dataframe['order_date'] <= "20201031 & 
                              & my_big_dataframe['customer'] == "Apple"]

versus

my_big_dataframe.query('order_date >= "20201001" and order_date <= "20201031" and customer == "Apple"']

In addition, "query" statements can be dynamically formatted.

If you believe otherwise, could you add text as to why you don't prefer .query ?

  1. You might want to start using the new nullable types (String, Int64, etc.) and pd.NA in your examples
  2. In the "column selection" section, one advantage of using something like df.column is that if you are in a notebook, you can get autocompletion, which can help with long column names. But your point that all names might not work is also correct.
  3. You might want to take a look at Tom Augspurger's "Modern Pandas" for more ideas: https://tomaugspurger.github.io/modern-1-intro.html

Hope this helps.

@joshlk
Copy link
Owner

joshlk commented Dec 28, 2020

Hi @Dr-Irv, thank you for reviewing the document - much appreciated 😃. Here are my replies to your comments:

  1. Querying data: I agree that .query can be better when dataframe names are long. However, you can easily use a shorthand variable name in these circumstances e.g.
df = very_long_dataframe_name
even_longer_dataframe_name = df[df['col1'] > df['col2']]

By using Python expressions it also encourages reuse and naming of query expressions, for example:

df = car_ownership_2010_to_2020
has_car = df['driving_car'] & (df['miles_drived'] > 0) & (df['age'] > 15)
is_insured = df['is_insured'] & df['no_claim_bonus_2020']
drivers_mask = has_car & is_insured

drivers = df[drivers_mask]
non_drivers = df[~drivers_mask]

Would you agree?

  1. Null types: yes good point. Thanks
  2. Column selection: I agree this feature is useful when using notebooks as it provides autocompletion. However, this style guide is intended for production code and so more likely written using plain python files and this advantage is lost.
  3. Thanks for the link. I've added a section about "Avoid chained indexing" prompted by these blog posts.

To track changes inspired by your comments I have added them to a separate PR: #2 .

Thanks again for your contribution,
Josh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants