When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates #27

rctatman · 2018-05-04T18:17:32Z

I've run into a strange bug where, when specifying tolerance for dedup(), the number of rows returned is greater than the number of rows in the original dataset:

dim(iris) # 150 rows
dim(iris %>% dedup()) # 149 rows
dim(iris %>% dedup(tolerance = 0)) # 11067 rows
dim(iris %>% dedup(tolerance = 0.2)) #9156 rows
dim(iris %>% dedup(tolerance = 0.4)) # 4627 rows
dim(iris %>% dedup(tolerance = 0.6)) # 2640 rows
dim(iris %>% dedup(tolerance = 0.8)) # 431 rows
dim(iris %>% dedup(tolerance = 1)) # 150 rows

These additional rows are exact duplicates & can be removed with distinct(), but it seems to be unintended behavior.

The text was updated successfully, but these errors were encountered:

sckott · 2018-05-08T22:55:16Z

👋 @rctatman - sorry about the delay, was on vacation, then email notification sank down.

bump dev version its definitely a hacked solution, think of something better later

sckott · 2018-05-08T22:56:05Z

can you reinstall and try again?

rctatman · 2018-05-14T21:31:40Z

Looks like it's fixed in version 0.1.3.9321! 👍

sckott added the bug label May 8, 2018

sckott added a commit that referenced this issue May 8, 2018

#27 fix to dedup() to remove duplicate entries

e965bd4

bump dev version its definitely a hacked solution, think of something better later

rctatman closed this as completed May 14, 2018

sckott added this to the v0.3 milestone May 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates #27

When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates #27

rctatman commented May 4, 2018

sckott commented May 8, 2018

sckott commented May 8, 2018

rctatman commented May 14, 2018

When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates #27

When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates #27

Comments

rctatman commented May 4, 2018

sckott commented May 8, 2018

sckott commented May 8, 2018

rctatman commented May 14, 2018