Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: 35977 Adding regex support for ExtensionBlock replace method #36038

Closed
wants to merge 14 commits into from

Conversation

cgangwar11
Copy link
Contributor

Overrides replace method to add support of regex in ExtensionBlock . If input to_replace is a regex we convert data type of block to Object and calls object replace method which supports regex otherwise we call replace method of super class Block.

In [2]: b.replace(r'^\s*$', pd.NA, regex=True, inplace=True)
   ...: print(b)
   a     b
0  a     d
1  b  <NA>
2  c  <NA>

In [3]: b.replace(r'^\s*$', pd.NA, regex=True, inplace=False)
Out[3]: 
   a     b
0  a     d
1  b  <NA>
2  c  <NA>


@cgangwar11 cgangwar11 changed the title Gh35977 Bug: 35977 Adding regex support for ExtensionBlock replace method Sep 1, 2020
@cgangwar11 cgangwar11 changed the title Bug: 35977 Adding regex support for ExtensionBlock replace method BUG: 35977 Adding regex support for ExtensionBlock replace method Sep 1, 2020
@jreback
Copy link
Contributor

jreback commented Sep 2, 2020

pls merge master as we just merged a change in how replace works in internals.

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Internals Related to non-user accessible pandas implementation labels Sep 2, 2020
@cgangwar11
Copy link
Contributor Author

@jreback synced with upstream master also verified that this bug is still present in upstream master

@simonjayhawkins
Copy link
Member

@cgangwar11

Conflicting files
pandas/tests/arrays/categorical/test_replace.py

regex = validate_bool_kwarg(regex, "regex")
if regex:
dtype = self.values.dtype
blocks = self.astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give this a non-plural name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgangwar11 can you address?

- Fixed Bug where :class:`DataFrame` column set to scalar extension type via a dict instantion was considered an object type rather than the extension type (:issue:`35965`)
- Fixed bug where ``astype()`` with equal dtype and ``copy=False`` would return a new object (:issue:`284881`)
- Fixed bug when applying a NumPy ufunc with multiple outputs to a :class:`pandas.arrays.IntegerArray` returning None (:issue:`36913`)
- Fixed bug in :meth:`Dataframe.replace` not working for ``regex=True`` and ``dtype='string'`` Now working as expected (:issue:`35977`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to:

- Fixed bug in :meth:`DataFrame.replace` now returns correct result for `regex=True`` with ``string`` dtype (:issue:`35977`)

)
def test_replace_string(to_replace, value, input_data, expected_data):
# GH35977
input_df = pd.DataFrame({"col1": input_data}, dtype="string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this df

# GH35977
input_df = pd.DataFrame({"col1": input_data}, dtype="string")
expected_df = pd.DataFrame({"col1": expected_data}, dtype="string")
input_df = input_df.replace(to_replace, value, inplace=False, regex=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this result

def test_replace_string(to_replace, value, input_data, expected_data):
# GH35977
input_df = pd.DataFrame({"col1": input_data}, dtype="string")
expected_df = pd.DataFrame({"col1": expected_data}, dtype="string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this expected

("f", "replace", ["d", "ee", "f", ""], ["d", "ee", "replace", ""]),
],
)
def test_replace_string(to_replace, value, input_data, expected_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the idiom:

result = ...
expected = ...
tm.assert_frame_equal(result, frame)

)
def test_replace_regex(to_replace, value, input_data, expected_data, inplace):
# GH35977
input_df = pd.DataFrame({"col1": input_data}, dtype="string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this df

def test_replace_regex(to_replace, value, input_data, expected_data, inplace):
# GH35977
input_df = pd.DataFrame({"col1": input_data}, dtype="string")
expected_df = pd.DataFrame({"col1": expected_data}, dtype="string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this expected

# GH35977
input_df = pd.DataFrame({"col1": input_data}, dtype="string")
expected_df = pd.DataFrame({"col1": expected_data}, dtype="string")
converted = input_df.replace(to_replace, value, inplace=inplace, regex=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this df_replaced

],
)
def test_replace_regex(to_replace, value, input_data, expected_data, inplace):
# GH35977
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the idiom:

result = ...
expected = ...
tm.assert_frame_equal(result, frame)

regex = validate_bool_kwarg(regex, "regex")
if regex:
dtype = self.values.dtype
blocks = self.astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgangwar11 can you address?

@arw2019
Copy link
Member

arw2019 commented Oct 30, 2020

Also will need to merge master

(r"f", "replace", ["d", "ee", "f", ""], ["d", "ee", "replace", ""], True),
],
)
def test_replace_regex(to_replace, value, input_data, expected_data, inplace):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is testing DataFrame.replace, these belong in tests.frame.methods.test_replace

@pep8speaks
Copy link

pep8speaks commented Nov 12, 2020

Hello @cgangwar11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-15 07:14:13 UTC

("f", "replace", ["d", "ee", "f", ""], ["d", "ee", "replace", ""]),
],
)
def test_replace_string(self, to_replace, value, input_data, expected_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parametrize these using frame_or_series

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this addressed?

for b in block.replace(to_replace, value, inplace, regex)
]
block.replace(to_replace, value, inplace, regex)
return block.astype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently implement Block._replace_regex. can you see if we can avoid the object-dtype cast (ideally even this whole method) by using that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace_regex is using np.vectorize function internally which gives an error when using not object dtype hence using replace_regex after typecasting.
f = np.vectorize(re_replacer, otypes=[values.dtype])

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgangwar11 what's the status of this? it sounds like you addressed most (possibly all) comments so if you merge master and move the whatsnew to 1.3 we'll take another look

One question I have is whether this affects the ExtensionDtype Series.replace case? If yes we'd like to test with series

Another question is whether we have asvs for this? (If not that doesn't have to be part of this PR IMO)

("f", "replace", ["d", "ee", "f", ""], ["d", "ee", "replace", ""]),
],
)
def test_replace_string(self, to_replace, value, input_data, expected_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this addressed?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 8, 2021
@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

closing as stale. if you want to continue, pls ping and can re-open.

@jreback jreback closed this Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Internals Related to non-user accessible pandas implementation replace replace method Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: replacement works for object but not string dtype
7 participants