-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wordcloud for categorical features #129
Comments
Hey this is a cool idea! I would recommend using a graphic that shows information in more accessible way than a word cloud, like even a histogram of the TD-IDF values by word. |
I guess it depends on what you're trying to communicate. Wordclouds are great for showing a bunch of information in a pleasing way, and often communicating an 'obvious' but not exactly quantified messages (e.g. for a wordcloud of country: "your customers live in 5 different countries, but mostly in Australia", or for gender "there are two genders, split roughly evenly"). For some purposes, they're a faster method of communication - e.g. I can skim over a wordcloud of gender in well under a second and extract all I care about (are M/F there, are they roughly even, etc.), and move on. Looking into normal graphs takes a little more work, and it's not obvious from skimming the graph (until you read e.g. the bar graph labels) what your'e looking at. Wordclouds are also pretty easy - nearly anyone (non-technical included) will understand what they're trying to convey, without requiring previous experience to interpret the graph (which, e.g. the one you sent would require - as a technical person, after a 5 second skim I wasn't sure what it was saying, or what the assumptions were, etc.). However, they not good for anything more nuanced (i.e. where you want to do more than just skim over it - or you've got much more complicated data). For that, I agree that there are better solutions (including the one you mentioned). But, as above, it also depends on whether you're handling short or long strings, etc. I guess it depends on the users of pandas-profiling. |
Word clouds look nice, but I'm not a fan of word clouds to be honest. Sure it is easy to spot the word with largest count frequency. But more difficult to discern the 2nd largest, 3rd largest, etc word frequency. I'd rather see a simple horizontal Pareto bar chart. Function over aesthetics especially for a library that is already having to do a lot of computations. |
So, this needs a bit of care.
Also, to confirm - I'm not overly sold on them. If I had to pick between one (for all text types) I'd probably go with a horizontal bar over a wordcloud.
I guess this depends on what you're looking at - for me, if I'm looking at a bunch of e.g. first names, all I really care about is that a) they all look like names and b) there aren't any massively over-represented values like "John Doe". I don't think I've ever had any use for "John" being slightly more popular than "Peter". Often the columns are industry specific codes, so I don't even know what they mean, and I wouldn't want to be digging that deep at a profiling stage. Anyway, I guess where I'm coming from is that I don't want "function over aesthetics" to shut down productive discussion. (For example, if someone really wanted wordclouds enough - not me - they could turn some of the above points into a more structured proposal, and give examples of wordclouds against bar charts in different scenarios ... and give a concrete plan for how this could be incorporated into the UI etc.) |
Stale issue |
I've just come across this project, and it looks great - better than things we've built (internally) in the past. That said, one thing we found quite useful was having a wordcloud to display frequencies of unique categorical values (i.e. the size of the text corresponds to the frequency). For example:
For numbers people, it'll be of lesser value - though you could argue it's easier to communicate they key results quickly and in less space. For non-technical people (e.g. clients) it's often a lot more useful. Other comments:
I'm not recommending it per se - just an idea, and we'll see what others think.
The text was updated successfully, but these errors were encountered: