Skip to content
This repository has been archived by the owner on Sep 9, 2022. It is now read-only.

When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates #27

Closed
rctatman opened this issue May 4, 2018 · 3 comments
Labels
Milestone

Comments

@rctatman
Copy link

rctatman commented May 4, 2018

I've run into a strange bug where, when specifying tolerance for dedup(), the number of rows returned is greater than the number of rows in the original dataset:

dim(iris) # 150 rows
dim(iris %>% dedup()) # 149 rows
dim(iris %>% dedup(tolerance = 0)) # 11067 rows
dim(iris %>% dedup(tolerance = 0.2)) #9156 rows
dim(iris %>% dedup(tolerance = 0.4)) # 4627 rows
dim(iris %>% dedup(tolerance = 0.6)) # 2640 rows
dim(iris %>% dedup(tolerance = 0.8)) # 431 rows
dim(iris %>% dedup(tolerance = 1)) # 150 rows

These additional rows are exact duplicates & can be removed with distinct(), but it seems to be unintended behavior.

@sckott sckott added the bug label May 8, 2018
@sckott
Copy link
Collaborator

sckott commented May 8, 2018

👋 @rctatman - sorry about the delay, was on vacation, then email notification sank down.

sckott added a commit that referenced this issue May 8, 2018
bump dev version
its definitely a hacked solution, think of something better later
@sckott
Copy link
Collaborator

sckott commented May 8, 2018

can you reinstall and try again?

@rctatman
Copy link
Author

Looks like it's fixed in version 0.1.3.9321! 👍

@sckott sckott added this to the v0.3 milestone May 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants