This repository contains R code to munge, analyze, and publish (as a Shiny web app) summaries of word counts in National Science Foundation (NSF) grant abstracts from the Division of Mathematical Sciences (DMS). See mathtrends.ssk.im for an example.
The relevant files are:
xml_munge.R
gen_tdm.R
tdm_dms.R
ui.R
server.R
For any questions, bug reports, etc., contact Steven S. Kim via e-mail at [email protected].
Required R packages include:
XML
plyr
data.table
tm
RWeka
ggplot2
stringr
The XML files containing abstract data were downloaded from the NSF website.
- This project was heavily influenced by the Google Ngram viewer.
- Default constants look through years 1990 -- 2015, but this was an arbitrary choice, and easily changed by updating the YEARS constant in the code. However, many XML files from earlier years do not contain abstract data.
- Key functionality is provided by the
tm
text-mining package in R. - The file
tdm_dms.R
sparsifies the TermDocumentMatrix to only include terms which occur in at least 20% of the years analyzed. - A few example terms with interesting trends:
machine learning
vs.data
vs.statistics + statistical
biology + biological
underrepresented, minority + minorities
outreach
young researchers
andundergraduate, graduate
develop, advance + advances
the project will
network + networks
control, partial differential
- Some eventual TODOs:
- smoothing the time series
- a "shuffle" option incorporating list of sample queries
- make the plot interactive with tool-tips on hover
- look at all divisions and make comparisons across NSF
- compare to NIH/DOD/NSERC funding priorities
- use a Markov model to generate a "sample" abstract
- map textual differences across corpora
- a "dollar-weighted" count (weighting gram proportion in a given grant by dollars in grant)