-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inner_join() reorders output #684
Comments
I see similar behavior with left_join--the ordering is not necessarily preserved nor are the levels themselves:
|
Duplicate of #730. |
Not duplicate, this is about data frames |
I confirm similar problems with I noticed this updating a Data Carpentry lesson that I last touched on August 8th. The new behaviour of |
For
That's what I expect. |
Right, about the ordering of In that case because This was added based on some advice that we should train based on the smaller data set. If we want to preserve ordering based invariably on the order of the second data set, then this superseeds the hack and we should just drop the corresponding branch.
|
@hadley should I just drop the The alternative is to ensure that |
@martinlenardon which version did you use ? with the current level version, I get:
|
@jennybc can you spare a reprex please ? |
That was dplyr_0.3.0.1, here's what I get after installing the latest from github:
So I still get the reduced factor levels. On the other hand, I have not updated my R version, so I can give that a try, too:
|
@martinlenardon well I get:
with
|
Yep. I get the same behavior after upgrading:
|
I will post a reprex later today. I stripped my real example way down and the factor level mangling went away (!). Then I re-ran actual example and problem persists. It will take a little work to find simplest version that demonstrates the undesirable behaviour with |
Here is my actual example as a Gist with an R script and a compiled Markdown notebook: https://gist.github.com/jennybc/bd1626735dd221c30f1a I am The output of If I get the time, I'll try to strip it down further. |
UPDATED. OK here is a much smaller example that demonstrates the re-alphabetizing of factor levels. m_days <- data_frame(mon = factor(c("jan", "feb", "apr"),
c("jan", "feb", "mar", "apr")),
n_days = c(31, 28, 30))
m_equi <- data_frame(mon = factor(c("jan", "feb", "mar", "apr"),
c("jan", "feb", "mar", "apr")),
has_equinox = c(FALSE, FALSE, TRUE, FALSE))
m_joined <- left_join(m_equi, m_days)
levels(m_days$mon)
levels(m_equi$mon)
levels(m_joined$mon) Here's what that looks like when run: > library(dplyr)
> m_days <- data_frame(mon = factor(c("jan", "feb", "apr"),
+ c("jan", "feb", "mar", "apr")),
+ n_days = c(31, 28, 30))
> m_equi <- data_frame(mon = factor(c("jan", "feb", "mar", "apr"),
+ c("jan", "feb", "mar", "apr")),
+ has_equinox = c(FALSE, FALSE, TRUE, FALSE))
> m_joined <- left_join(m_equi, m_days)
Joining by: "mon"
> levels(m_days$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_equi$mon)
[1] "jan" "feb" "mar" "apr"
> levels(m_joined$mon)
[1] "apr" "feb" "jan" "mar" |
@jennybc oh hmmmm, that's a more complicated problem because you have two factors with different levels. Currently I think we're only treating factors as equivalent if the levels are exactly equal, but this suggests we should also consider factors to be equivalent if the levels of one are a subset of the other. (But if the levels aren't equal, you're supposed to get a string, not a factor (with a warning maybe?) so that's another bug) |
Sorry @hadley my first post of the simpler example had a mistake -- the different levels. I then updated so they have the same levels and the problem still happens. I understand that all bets are off if the levels are different for the input factors. |
@hadley . I did not realize that factors with different levels should produce a character vector. @jennybc I get:
Ok so the imp I got so far is this:
So perhaps the second case should produce a character vector. I guess when levels of one are subset of the other one could indeed represent the same concept, so perhaps we should deal with that, and e.g. keep the ordering of the bigger one ? |
This is the behaviour I expect:
I think it would be nice to have special behaviour if the levels of one are a subset of the other, but I don't think it's that important. (Just as long as there's a message so the user knows what they need to do to fix the problem). I don't think we should ever create a factor that has a different levels (value or order) to the input. |
I think it's all taken care of now. This simplified code for factor/factor with different levels. Please reopen if I forgot something. |
`inner_join` no longer reorders (#684).
As discussed on StackOverflow
http://stackoverflow.com/questions/26279548/how-to-replace-lookup-based-on-row-names-when-using-dplyr
The text was updated successfully, but these errors were encountered: