Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding marginal plots for grouped data #61

Closed
kassambara opened this issue Jun 16, 2017 · 25 comments
Closed

Adding marginal plots for grouped data #61

kassambara opened this issue Jun 16, 2017 · 25 comments

Comments

@kassambara
Copy link

Hi Dean,

Thank you for your work in ggExtra package, which makes it really easy to add marginal plots to ggplots.

It would be highly appreciated, if one can add marginal plots for grouped data as illustrated here and here

Suggestion: Improve ggMarginal()so that it reacts to the mapping arguments.

sp <- ggplot(iris, aes(Sepal.Length, Sepal.Width))+
  geom_point(aes(color = Species))

ggMarginal(sp, data = iris, mapping = aes(color = Species) )

scatter-plot-and-correlation-marginal-plot-grouped-data-1

Best regards,

@daattali
Copy link
Owner

You are right, this would be useful to others as well. I unfortunately will probably not have time to look into this feature myself, but I would be happy to accept a pull request if someone wants to take the lead on this feature.

@crew102
Copy link
Contributor

crew102 commented Jun 16, 2017

This would be do-able...It would be a little awkward given that we currently use geom_line for creating the density plots, which we would have to move over to geom_density so that we could fill the distributions with color (i.e., specify a fill param).

I actually think the API suggested by @kassambara is good. I.e., the call would look something like:

p <- ggplot(data = mtcars) + geom_point(aes(x = mpg, y = wt, colour = gear))
ggMarginal(p = p, margMapping = aes(colour = gear))

I think we'll want to require that the user specifies a color or fill mapping for the scatterplot if they also specify one for the marginal plots. We could rely on the xParams and yParams arguments for passing in alpha values of the filled marginal plots, too.

I'll take a stab at it sometime next week. @daattali , we should think about submitting a new version to CRAN after this as well, no?

@daattali
Copy link
Owner

Yep, I already emailed the authors of packages using ggextra and told them about an upcoming cran release and to check the package for any regression bugs. We're good to go for CRAN. If you're thinking to have a go at this within the next few weeks then the cran release can wait for that.

API: is the idea that the user can also specify a different mapping than the one in the plot? And would using the aes() function be required? In ghplot aes is needed because without it you take the value literally rather than a mapping, would that be needed here as well?

@crew102
Copy link
Contributor

crew102 commented Jun 16, 2017

If you're thinking to have a go at this within the next few weeks then the cran release can wait for that.

Yeah, let's wait until I take a stab at implementing this feature

API: is the idea that the user can also specify a different mapping than the one in the plot?

Technically, yes, but the mapping should use the same variable. For example, this would be OK:

p <- ggplot(data = mtcars) + geom_point(aes(x = mpg, y = wt, colour = gear))
ggMarginal(p = p, margMapping = aes(fill = gear))

But we would not be supporting this:

p <- ggplot(data = mtcars) + geom_point(aes(x = mpg, y = wt, colour = gear))
ggMarginal(p = p, margMapping = aes(colour = cyl))

And would using the aes() function be required?

We wouldn't have to use aes() . I was planning on parsing the aes call and going from there, instead of using it directly (so it would actually be easier to not use it). Do you think using aes would be confusing, given that we won't actually be making a call to it? I was actually originally thinking we should do something like this:

ggMarginal(p = p, margMapping = list(colour= cyl))

But I came around on the use of aes because we are doing something conceptually similar to using aes directly.

@daattali
Copy link
Owner

I think if we're not actually using aes() then we shouldn't require the user to use it because they might assume that they can write anything that works for aes() in there. Just like ggMarginal() already has x and y params that accept a variable name, and it's not wrapped in aes().

Would there be a technical limitation or any extra code to make something like

p <- ggplot(data = mtcars) + geom_point(aes(x = mpg, y = wt, colour = gear))
ggMarginal(p = p, margMapping = list(colour = cyl))

work? From an implementation point of view, does it matter that the grouping in the plot and in the margin is not the same?

@crew102
Copy link
Contributor

crew102 commented Jun 16, 2017

There would be two things that would make it awkward/more difficult if we tried to allow that:

  1. We would have to find a place to put the second legend
  2. We would have to add another param for manually mapping the colors of cyl to whatever the user wants to use. If we have just have colour = gear in the call to ggMarginal, ggplot figures out the colors from any potential call to, say, scale_color_manual and puts them in the dataframe that we are using.

@daattali
Copy link
Owner

Good point re: legend.

Would the only allowed values be either "colour" and "fill", or would it allow any kind of mapping? And what exactly would the enforcement on the variables be - would it only allow variables that already have some mapping in the original plot?

@crew102
Copy link
Contributor

crew102 commented Jun 16, 2017

I think the only relevant values for this would be colour or fill...Can you think of any others? The enforcement would basically just check that the variable specified in margMapping either be mapped to color or fill in the scatter plot. Also note that we wouldn't be supporting the data param for this feature (i.e., if the user wants to use margMapping, they have to pass in p instead of passing in data, x, and y.

@daattali
Copy link
Owner

daattali commented Jun 16, 2017

If it's just colour and fill, then it feels wrong to me to have a parameter that claims to take a list of mappings when there are only 2 allowed elements.

What do you think instead of one of these two options, which would be the best for end users?

  • Having two params like colourGroup and fillGroup (these might be terrible names - maybe colourVar and fillVar? I'm bad with naming things)
  • Simply having a single boolean param such as marginalGroup = TRUE/FALSE with FALSE as default. When TRUE, the colour and fill mappings that exist in the original plot get copied over to the marginal plot. This sounds like it could be simpler code and simpler for users perhaps?
  • (option 3: what you were suggesting above)

Let me know your thoughts.

@crew102
Copy link
Contributor

crew102 commented Jun 16, 2017

My first instinct was to do a combo of choices 1 and 2, so something like:

ggMarginal(p = p, marginalGroup = list(colourGroup = TRUE, colourAlpha = .4, fillGroup = FALSE, fillAlpha = NA))

With the reason being that, I think people will want to use different values for the alpha of the points vs the fill of the distributions. I don't have any strong feelings for whether we just have one marginalGroup argument (which would be a list with 4 elements) or two arguments (colourGroup and fillGroup, each with 2 elements). I think it's going to be awkward any way we do it, to be honest. What do you think is the most intuitive for users?

@crew102
Copy link
Contributor

crew102 commented Jun 16, 2017

Nvm, I forgot what I was planning to do for alpha, which was to just suggest that users specify it in the xParams or yParams argument...So I guess your option 2 would also work....I think I actually like that option the most, come to think of it!

But we should seperate colour and fill...So either a single marginalGroup argument which takes a list of two bools, or two arguments (colourGroup and fillGroup, both of which take a single bool)

@daattali
Copy link
Owner

I don't follow the whole alpha thing. Why is alpha needed for the marginal plots? I think alpha should always be 1 for the marginal density/histogram.

In the marginal plot, would it make sense to have mappings for both colour and fill into different variables? I don't even know what that would look like

@crew102
Copy link
Contributor

crew102 commented Jun 17, 2017

Alpha is needed (at least for fill) for the marginal plots because alpha = 1 will result in you not being able to see the distributions when they overlap. For example, in the example that kassambara posted, you get to see what the distributions look like across their entire support, even when there is another distribution that is overlapping. So we would want to set a default value for alpha somewhere around .5, I think.

I think we should just allow one variable to be mapped to fill or colour (or potentially both)...Using two different variables in the marginal map would bring up the two issues I mentioned above (e.g., adding an extra legend).

@daattali
Copy link
Owner

Right, alpha <1 definitely needed. But let's just fix it at a value, doesn't need to be customized. You're right.

My second question was: would both colour AND fill be able to get a mapping? What would it look like when they both are used?

@crew102
Copy link
Contributor

crew102 commented Jun 17, 2017

I think we should allow users to specify the alpha level, given that it will be difficult to choose a default that looks good for all different scenarios (i.e., many vs few groups, lighter vs darker cols, etc.).

Regarding your second question, that's what I thought you meant...We could potentially map a single variable to both fill and colour (but again, there would be no support for two different variables mapped to fill and colour). When you specify a fill param but no colour, the distribution(s) is outlined in black:

library(ggplot2)
mtcars$gear <- as.factor(mtcars$gear)
ggplot(data = mtcars) + 
  geom_density(aes(x = mpg, fill = gear), alpha = .3)

rplot

When you specify colour as well, the outline shares the same colour as the fill, and you only get one legend (at least for the current version of ggplot2 that I'm at):

ggplot(data = mtcars) + 
  geom_density(aes(x = mpg, fill = gear, colour = gear), alpha = .3)

rplot01

I just checked out the case for histograms, and it fill looks pretty bad. It's too difficult to tell which bins refer to which groups:

ggplot(data = mtcars) + 
  geom_histogram(aes(x = mpg, fill = gear), alpha = .3, 
                 position = position_identity(), bins = 10)

rplot02

The case for boxplot looks reasonable, though:

ggplot(data = mtcars) + 
  geom_boxplot(aes(x = mpg, y = mpg, fill = gear, colour = gear), alpha = .3, 
                 position = position_identity())

rplot03

I think we should support all three but just suggest that the user choose type to be either histogram or boxplot when he/she wants to specify a marginal mapping.

@kassambara
Copy link
Author

I think that fixing the default alpha = 0.5 is a good option. Having the possibility to use colourGroup = TRUE and/or fillGroup= TRUE will be also appreciated.

You might have also noted that, when type = "boxplot", the color/fill variable should be used as the x axis variable in the marginal box plot.

rplot02

Thank you :-)!

@kassambara
Copy link
Author

I'm wondering, If it wouldn't be better, if the final format of ggMarginal looks like this:

# Basic usage
ggMarginal(p)

# Grouped data
# (Only) color by groups
ggMarginal(p, colourGroup = TRUE)

# or 
# (Only) fill by groups
ggMarginal(p, fillGroup = TRUE, alpha = 0.5)

# or
# color and fill by groups
ggMarginal(p, colourGroup = TRUE, fillGroup = TRUE, alpha = 0.5)

Instead of this (more typing):

# Basic usage
ggMarginal(p)

# Grouped data
ggMarginal(p, margMapping = list(colourGroup = TRUE))
# or 
ggMarginal(p, margMapping = list(fillGroup = TRUE, alpha = 0.5))
# or
ggMarginal(p, margMapping = list(colourGroup = TRUE, fillGroup = TRUE, alpha = 0.5))

@daattali
Copy link
Owner

@kassambara thank you for your input

@crew102 and I are discussing this, and it seems like the likely API will indeed be without a list.

A few more items we agreed on:

  • alpha will be defaulted to 1, as there already is an implicit alpha in every ggmarginal call. You can pass alpha into the ... argument, and it should also work for the case of grouped data. The documentation for this feature should make it clear that the user may find it useful to explicitly set alpha to a different fraction, but it does not need to be an enumerated argument
  • since this feature will allow grouped data to have a "fill" colour for density plots, we should also add support for "fill" in non-grouped data (currently, "fill" is not supported in density plots)

We did not settle on whether the colourGroup/fillGroup will be boolean flags or the name of a variable, though leaning towards the former. Need to ensure whatever we choose is not too restrictive and will support these scenarios:

  • ggplot(mtcars) + geom_point(aes(x = mpg, y = wt, colour = gear)) with fillGroup based on gear (even though original plot doesn't have a fill)
  • ggplot(mtcars) + geom_point(aes(x = mpg, y = wt, colour = gear, fill = carb)) with both fillGroup and colourGroup based on gear (even though original plot has a different fill from colour)

@daattali
Copy link
Owner

@crew102 I think we left this unresolved - do you have time/would like to come back to this?

@crew102
Copy link
Contributor

crew102 commented Sep 13, 2017

Yeah, I've been meaning to get to it. I'll probably push something in the next 1-2 weeks.

This was referenced Sep 25, 2017
@crew102
Copy link
Contributor

crew102 commented Feb 15, 2018

Closed?

@daattali
Copy link
Owner

Indeed! @kassambara this exists now

@nmasto
Copy link

nmasto commented Jun 17, 2023

Way late to this but perhaps worthwhile - I cannot figure out how to combine the functionality of fvis_pca_ind with ggMarginal. Even after adding a grouping variable outside of the fvis_pca_ind() argument using geom_point, ggMarginal doesn't appear to recognize the grouping variable. See code below - kind of ugly. Is this a communication breakdown between fvis_pca_indtoggplottoggMarginal` ? It recognizes that there are 3 groups but not the color or fill.

state <- fviz_pca_ind(move_pca,
# Individuals
fill.ind = dat$state,
# col.ind = "black",
# pointshape = 21,
# col = "black",
# fill = movevars1$state,
# pointsize = 2,
# labelsize = 5,
alpha = 0.5,
palette = cols,
addEllipses = TRUE,
ellipse.type = "confidence",
ellipse.level = 0.95,
mean.point = FALSE,
label = "var",
col.var = "black",
repel = TRUE,
legend.title = "",
ggtheme = theme_minimal(base_size = 16)) + # Close fviz_pca_ind
labs(title = "",
x = "Time (PC1)",
y = "Energy (PC2)"
) +
geom_point(aes(dat$pc1, dat$pc2, fill = dat$state), color = "black", size = 2, shape = 21) + # rewrite points
scale_fill_manual(values = cols) + # rewrite colors
theme_bw(base_size = 16) +
theme(aspect.ratio = 1,
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(color="black", size = 14),
legend.position = c(.35, .95),
legend.justification = c("right", "top"),
legend.box.background = element_blank()
#axis.title.y = element_text(color="black", size = 20),
#axis.title.x = element_text(color="black", size = 20)
)

state1 <- ggMarginal(state, type = "density", col = "black", groupFill = TRUE)```

image

@nmasto
Copy link

nmasto commented Jun 17, 2023

Sorry. For context even when I try to specify the data, it says the nrows are misaligned despite the ggplot object has stored data with all the data stored, including a fill variable -- so even specifying my own data, and x,y coords, and fill object throws an error which I'm not sure why:

tail(state$data) # 143 observations with x, y, and fill variables

    name            x           y       coord        cos2      contrib     Fill.
138  138  0.029920576 -0.04905992 0.003302116 0.002063812 0.0005767548 Tennessee
139  139  0.469097338 -0.15273796 0.243381196 0.257227431 0.0425094847 Tennessee
140  140  0.384657694  0.19810512 0.187207182 0.319287372 0.0326980103 Tennessee
141  141 -2.198499773 -0.72048493 5.352499779 0.620500493 0.9348791579 Tennessee
142  142 -0.926122738  0.44988083 1.060096089 0.806648433 0.1851586697 Tennessee
143  143 -0.009075775 -0.56402287 0.318204172 0.205827671 0.0555782271 Tennessee

state1 <- ggMarginal(data = state$data, x = state$data$x, y = state$data$y, fill = state$data$Fill., type = "density")

Error in `ggplot2::geom_density()`:
! Problem while setting up geom aesthetics.
ℹ **Error occurred in the 1st layer.
Caused by error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (512)**
✖ Fix the following mappings: `fill`
Backtrace:
  1. ggExtra::ggMarginal(...)
  5. ggExtra:::addTopMargPlot(pGrob, top, size)
  6. ggExtra:::getMargGrob(top)
  7. ggplot2::ggplotGrob(margPlot)
 12. ggplot2:::ggplot_build.ggplot(x)
     ...
 21. l$compute_geom_2(d)
 22. ggplot2 (local) compute_geom_2(..., self = self)
 23. self$geom$use_defaults(data, self$aes_params, modifiers)
 24. ggplot2 (local) use_defaults(..., self = self)
 25. ggplot2:::check_aesthetics(params[aes_params], nrow(data))

Not sure where ggMarginal is pulling the data to get 512 for geom_density() when the data is clearly only 143 observations long. Thanks for any advice if/when convenient.

@nmasto
Copy link

nmasto commented Jun 17, 2023

I got it too work by adding the argument habillage to fviz_pca_ind(). Apologies for any inconviennce. Man do those plots look good. Great work with ggMarginal.

figure_5_ordination_dens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants