The final project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should be talking frequently with your instructors and classmates about them.
Address a data-related problem in your professional field or a field you're interested in. Pick a subject that you're passionate about; if you're strongly interested in the subject matter it'll be more fun for you and you'll produce a better project!
To stimulate your thinking, here is an excellent list of public data sources. Using public data is the most common choice. If you have access to private data, that's also an option, though you'll have to be careful about what results you can release. You are also welcome to compete in a Kaggle competition as your project, in which case the data will be provided to you.
You should also take a look at past projects from other GA Data Science students, to get a sense of the variety and scope of projects.
You are responsible for creating a project paper and a project presentation. The paper should be written with a technical audience in mind, while the presentation should target a more general audience. You will deliver your presentation (including slides) during the final week of class, though you are also encouraged to present it to other audiences.
Here are the components you should aim to cover in your paper:
- Problem statement and hypothesis
- Description of your data set and how it was obtained
- Description of any pre-processing steps you took
- What you learned from exploring the data, including visualizations
- How you chose which features to use in your analysis
- Details of your modeling process, including how you selected your models and validated them
- Your challenges and successes
- Possible extensions or business applications of your project
- Conclusions and key learnings
Your presentation should cover these components with less breadth and less depth. Focus on creating an engaging, clear, and informative presentation that tells the story of your project.
You should create a GitHub repository for your project that contains the following:
- Project paper: any format (PDF, Markdown, etc.)
- Presentation slides: any format (PDF, PowerPoint, Google Slides, etc.)
- Code: commented Python scripts, and any other code you used in the project
- Data: data files in "raw" or "processed" format
- Data dictionary (aka "code book"): description of each variable, including units
If it's not possible or practical to include your entire dataset, you should link to your data source and provide a sample of the data. (GitHub has a size limit of 100 MB per file and 1 GB per repository.) If your data is private, you can either include an "anonymized" version of your data or create a private GitHub repository.
What is the question you hope to answer? What data are you planning to use to answer that question? What do you know about the data so far? Why did you choose this topic? If you have sevaral ideas, please lay them all out! If you have data sources (even if youhavne't scraped them yet) please write those down as well. Upload this as a MARKDOWN file in your sfdat22_work repo
Example:
- I'm planning to predict passenger survival on the Titanic.
- I have Kaggle's Titanic dataset with 10 passenger characteristics.
- I know that many of the fields have missing values, that some of the text fields are messy and will require cleaning, and that about 38% of the passengers in the training set survive.
- I chose this topic because I'm fascinated by the history of the Titanic.
Zip up all files relevant to your project, and slack a link to Mars. Your peers and instructors will provide feedback, according to these guidelines.
At a minimum, you should include:
- Narrative of what you have done so far and what you are still planning to do
- Code, with lots of comments
Ideally, you would also include:
- Visualizations you have done
- Slides (if you have started making them)
- Data and data dictionary
Tips for success:
- The work should stand "on its own", and should not depend upon the reader remembering anything you might have previously said in class about your project.
- Organize your narrative and files so that the reader can easily follow along.
- The better you explain your project, and the easier it is to follow, the more useful feedback you will receive!
- If your reviewers can actually run your code on the provided data, they will be able to give you more useful feedback on your code. (It can be very hard to make useful code suggestions on code that can't be run!)
If you would like additional feedback on your project, submit a revised version of your project. Your instructors will provide feedback. (There is no peer review for the second draft.)
Deliver your project presentation in class, and submit all required deliverables (paper, slides, code, data, and data dictionary).