Returning functions in place #76

szuckerman · 2018-12-02T13:02:52Z

After seeing issue #67, was curious as to what people think as to adding this capability to all functions?

Some of them, like df.limit_column_characters(), already return in place. I don't think it will be hard to extend to others.

The text was updated successfully, but these errors were encountered:

ericmjl · 2018-12-02T13:10:29Z

🤔 🤔 🤔

This one is a tough one for me to respond to.

On one hand, I can see why that might be desirable by some users who are used to the inplace=True kwarg.

On the other hand, I think adding it in as an option means an additional thing to check for when any new function is submitted in. Conceptually, it's "just one less thing" to maintain if we just leave that argument out.

With a fluent API, I don't think in-place modifications are a thing.

That said, I'm willing to be persuaded by a counter-example to my general slant against an "in-place" interface, if one can provide an example where in-place modification accomplishes something substantial that a method chaining interface cannot accomplish.

szuckerman · 2018-12-02T15:40:07Z

I’m actually more inclined have the default be to return functions in place for a few reasons:

It doesn’t change the method chaning syntax
If people want to apply separate functions outside of the initial janitor section it’s merely ‘df.add_column()’ rather than ‘df=df.add_column’. I think the former is cleaner.
Since Pandas is used for a variety of reasons, with EDA being a big one, it could be a big problem to change dataframes in place since one’s work could be inadvertently clobbered. My assumption is that the janitor functions should run once at the beginning and then later people will work with Pandas after the data is cleaned. If someone did clobber a dataframe, I would assume he would just “start over”

ericmjl · 2018-12-02T19:19:00Z

@szuckerman, let me see if I can clarify my understanding of the definition of in-place vs. method chaining.

Taking the example from the README, regular pandas syntax looks as follows:

df = pd.DataFrame(...)  # create a pandas DataFrame somehow.
del df['column1']  # delete a column from the dataframe.
df = df.dropna(subset=['column2', 'column3'])  # drop rows that have empty values in column 2 and 3.
df = df.rename({'column2': 'unicorns', 'column3': 'dragons'})  # rename column2 and column3
df['newcolumn'] = ['iterable', 'of', 'items']  # add a new column.

First off, with in-place modifications at each line, we do:

df = pd.DataFrame(...)
df.remove_column('column1', inplace=True)
df.dropna(subset=['column2', 'column3'], inplace=True)
df.add_column('newcolumn', ['iterable', 'of', 'items'], inplace=True)
df.rename({'column2': 'unicorns',  'column3': 'dragons'}, inplace=True)

To clarify, inplace=True at each function means there is no return statement. Each function's definition must then add a check for inplace=True:

def function_name(df, kwargs, inplace=False):
    # code above
    if not inplace:
        return df

With method chaining, the syntax looks as such:

df = (
    pd.DataFrame(...)
    .remove_column('column1')
    .dropna(subset=['column2', 'column3'])
    .add_column('newcolumn', ['iterable', 'of', 'items'])
    .rename({'column2': 'unicorns',  'column3': 'dragons'})
)

With method chaining, each function must return the modified dataframe.

Without inplace as an optional kwarg, each function need not specify a conditional checking for the inplace kwarg value. Maintenance is simpler, and it's just one less thing for a newcomer contributor to worry about. I would like to keep things as simple as possible for newcomers to make contributions.

Also, without an inplace=True kwarg, we can default to the verb-based, method chaining interface. As it stands, there are already 3 ways that we can use pyjanitor - the functional interface, the pipe interface, and the method chaining interface. I'm reluctant to add a fourth (I had previously removed a different interface, one based on inherited dataframes).

szuckerman · 2018-12-02T20:22:54Z

Ok, I hear what you’re saying, and agree that things should be kept simple to allow for more contributions.

That being the case, some of the functions already work in place.

For example, limit_column_characters does, and I think add_column does too (but I’m not near a computer at the moment to check).

Should functions be modified so they specifically do not work in place?

zbarry · 2018-12-02T20:29:05Z

@szuckerman yup, add_column is in place. I personally prefer that for my datasets, because I'm dealing with pretty big frames & don't want pandas to copy all the data every time it spits out a new DataFrame after each operation in a chain. You could imagine the case where you have many operations chained and get a massive performance degradation because of all the copying that may be useless. Could always do my_dataframe.copy().op1().op2()... if you don't want it to mutate the original DataFrame.

You can see here that pandas .assign function copies the data:
#49 (comment)

ericmjl · 2018-12-02T21:21:07Z

That being the case, some of the functions already work in place.

@szuckerman yes, that's a valid concern, especially for consistency purposes.

I think the hack that we've implicitly used, but not explicitly stated, is to piggy back off pandas operations. If modification takes place in-place anyways (to prevent memory copying), then it happens in-place; if not, then we don't really care. For a pareto-distributed set of users, 80% will have datasets that fit in-memory, while 20% will have @zbarry's kind of needs, where dataframes are too large for a single computer's RAM.

At the moment, I don't think we're necessarily at a point where too many people will complain about the memory-usage architecture of pyjanitor. Perhaps when a few related issues come together, then it might be an issue? (I will defer to your experience, @szuckerman, as the more seasoned software engineer here, as to whether we should enforce consistency w.r.t. memory usage at this early point in the project.)

zbarry · 2018-12-02T21:29:08Z

Should mention that it's not a fit in memory problem or usage really, though that's definitely relevant... it's more performance-wise. Copying all that data every time you hit an op in a chain is really inefficient in time. Since the whole concept of chaining here means that can't get the intermediate dataframes out for each step anyway, having them mutate the original frame each time doesn't actually matter (provided you copy the original one before you begin to chain if it's necessary to keep it).

Also thinking that after a decision is made w.r.t in place / not, it would be difficult to change it (in the case of going from not in place to in place, impossible to not be very disruptive to users' code, I think).

szuckerman · 2018-12-05T04:56:25Z

@zbarry makes a good point that "since you can't get the intermediate dataframes out for each step anyway" returning in place is more performant.

However, after doing some research it doesn't look like returning objects in place makes too much of a difference. I think the method-chaining syntax saves most people from the concern of making mistakes (i.e. by accidentally running a method that works in place by default), as given as one of the examples on stackoverflow.

As for future considerations of the package, I think that to really determine whether methods should default to being in place or not depends on "usability". Will users be confused when dataframes are mutated in place? Later we can run tests, profile memory, etc. to determine if there's a difference between a method and its in place counterpart.

zbarry · 2018-12-05T15:31:53Z

Yeah, usability is probably core to the idea of pyjanitor even. That being said, I think having a (short) paragraph in the readme + in the notebook example I will (eventually) create as well as having a clear boilerplate message in all the function docs would convey this pretty well. If you make it such that all functions that are relevant w.r.t. this discussion operate on the same principle of in place, I think this gives a clear expection to users as to what approach pyjanitor takes for operations on DataFrames (that being that they should expect mutation).

Because I'm procrastinating, I ran a simple speed test for adding a column to a DataFrame with pandas' .assign vs. using the df[my_col] = data explicitly (which is what pyjanitors .add_column does): [edit: num_bytes is mislabeled and is the length of the array, instead. whoops]

The slowdown with the data copy in this particular operation is a pretty consistent 1.5x. This could definetely add up when users chain multiple operations (I think it's possible that some ops could have even worse slowdown multipliers).

zbarry · 2018-12-05T15:44:38Z

Slowdown multiplier when adding to a DataFrame that has multiple columns already is pretty horrible:

ericmjl · 2018-12-05T16:07:10Z

@zbarry those are great tests. Yes, I think it's worth adding to the docstrings (but without user warnings popping up - those are too easily abused) that for the relevant functions, underneath the hood, we do change the original dataframe in place for performance reasons.

zbarry · 2018-12-05T16:10:18Z

without user warnings popping up

Hmm... I was thinking of adding a pyqt5 dependency & popping up a message box warning you upon pyjanitor import of the mutation operations. Shame.

zbarry · 2018-12-08T20:05:03Z

Random idea: what about having colorful badges in the docs that you can see at a glance whether there is mutation / something else critical to know that a function does to your dataframe?

Like:

kind of stuff but with appropriate labels.

jcmkk3 · 2018-12-16T16:24:56Z

Not sure if anyone is following Pandas development, but there is currently a plan to deprecate the inplace parameter. That may be worth taking into consideration for this discussion.

pandas-dev/pandas#16529

ericmjl added the question Further information is requested label Dec 4, 2018

ericmjl added help wanted Extra attention is needed docfix Documentation fixes needed labels Dec 5, 2018

zbarry mentioned this issue Dec 9, 2018

Inconsistencies in original-dataframe mutation #79

Open

ericmjl mentioned this issue Dec 9, 2018

Select_Columns Function Added Suggestion... #77

Closed

asmirnov69 mentioned this issue Mar 21, 2023

document the arguments in favor of 'set_columns' inplace method pyjanitor-devs/pyjviz#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Returning functions in place #76

Returning functions in place #76

szuckerman commented Dec 2, 2018

ericmjl commented Dec 2, 2018 •

edited

Loading

szuckerman commented Dec 2, 2018 •

edited

Loading

ericmjl commented Dec 2, 2018 •

edited

Loading

szuckerman commented Dec 2, 2018

zbarry commented Dec 2, 2018

ericmjl commented Dec 2, 2018

zbarry commented Dec 2, 2018 •

edited

Loading

szuckerman commented Dec 5, 2018

zbarry commented Dec 5, 2018 •

edited

Loading

zbarry commented Dec 5, 2018

ericmjl commented Dec 5, 2018

zbarry commented Dec 5, 2018

zbarry commented Dec 8, 2018

jcmkk3 commented Dec 16, 2018

Returning functions in place #76

Returning functions in place #76

Comments

szuckerman commented Dec 2, 2018

ericmjl commented Dec 2, 2018 • edited Loading

szuckerman commented Dec 2, 2018 • edited Loading

ericmjl commented Dec 2, 2018 • edited Loading

szuckerman commented Dec 2, 2018

zbarry commented Dec 2, 2018

ericmjl commented Dec 2, 2018

zbarry commented Dec 2, 2018 • edited Loading

szuckerman commented Dec 5, 2018

zbarry commented Dec 5, 2018 • edited Loading

zbarry commented Dec 5, 2018

ericmjl commented Dec 5, 2018

zbarry commented Dec 5, 2018

zbarry commented Dec 8, 2018

jcmkk3 commented Dec 16, 2018

ericmjl commented Dec 2, 2018 •

edited

Loading

szuckerman commented Dec 2, 2018 •

edited

Loading

ericmjl commented Dec 2, 2018 •

edited

Loading

zbarry commented Dec 2, 2018 •

edited

Loading

zbarry commented Dec 5, 2018 •

edited

Loading