Final project for the Data Science Specialization Capstone Course
- Shiny application that demos the predictor
- Shiny Demo source
- Presentation Slides
- Slides source
- 1st Milestone Report
- Link to raw data
- Link to ngram models with MLE probabilities
analysis.R : code to generate ngrams
fetch_capstone_data
: fetch the Capstone datapreprocess_entries
: perform text preprocessing and data cleanupget_docterm_matrix
: function used to generate ngram model and compute the Maximum Likelihood Estimate for each ngram.
search_with_dataframes.R : code to build ngram models and perform search using the models
ngram_language_modeling_with_data_frames
: train models on 2-grams, 3-grams and 4-grams on sampled datamulti_search_tree_with_data_frames
: predict function to estimate the next word for an input. Performsstupid backoff
from ngram-4, to ngram-3 or ngram-2 if a model yields no results.predict_test_data
: predict the model accuracy for test datagenerate_queries_and_answers
,generate_queries_and_answers_from_csv
,generate_quiz_1_data
,generate_quiz_2_data
: methods to generate test data forpredict_test_data
build_ngram_4_partition
: experimental code to build a model with 100% of the data
grid_search.R
grid_search
: Attempt to find to the optimal value for "ngram coverage" to prune the ngram models using a grid search.
sample_data.R
sample_capstone_data
,generate_sample_files
: methods to generate samples of raw training data.
See steps.md for steps to build the ngram models.