Spring 2018; Mondays: 2:10-4pm
E3B Department Columbia University
Instructor: Dr. Deren Eaton ([email protected])
Teaching Assistant: Patrick McKenzie ([email protected])
Professor Eaton: Fridays, 10am-12pm in 1007 Schermerhorn Ext.
Patrick McKenzie: Tuesdays 2-4pm in student room, Schermerhorn Ext.
Programming and Data Science for Biologists (PDSB) will introduce students to fundamental computational skills and concepts for working with large biological data sets. This will include an introduction to several programming languages (Python, R, Julia), and in-depth training in one language in particular (Python). We will cover tools for collaboration and version control (git, GitHub), and how these tools can be used to host and share code, data, and websites. A core focus throughout the course will be reproducibility and learning tools (jupyter) and practices for this purpose. We will learn to organize and structure data for statistical analyses (DataFrames, arrays, datatypes), and explore tools for scientific analyses (scipy, pymc3, scikit-learn, keras) and visualization (matplotlib, toyplot, bokeh). Exercises and assignments will introduce students to large empirical datasets used in the biological sciences, from studies of genomics to biodiversity. The latter half of the class is organized around individual projects, in which students will be guided to design a command-line program and/or API for performing a specific type of analysis. Computer programs are ubiquitous in biology, but few biologists receive formal training in designing and writing software. This course offers a deeper introduction to computational techniques and algorithms commonly applied to biological datasets.
The course meets on Mondays from 2:10-4pm. Each meeting will be a mix of lecture, in-class "active" learning, and group activities. In addition there will be a lot of work outside of clas to complete individual and group assignments, as well as a final project. In class activities will include group problem solving and code comparisons. All software and materials for the course are open access (available online for free) including assigned readings. An example session would include a lecture to introduce a general concept with examples from biological research, followed by a group active-learning exercise in which students implement the method applied to real datasets.
There will be code-based assignments for nearly every class period. These will require completing assignments outside of class in addition to performing "code reviews" of submissions by classmates based on assigned criteria. For example, check that their code works, that it follows proper style, and describe how the implementation differs from your own code. Both code and code reviews will be graded. A course project will be developed by each student and defended through a formal proposal process and later presented as a final project. Projects will be developed and published as open source code on github and evaluated on the basis of documentation, task-completion, and examples.
Grades will be composed of 13 assignments (35%) and code reviews (20%), a project proposal (5%), project presentation (5%), and project grade (20%), as well as on class participation (15%).
Academic dishonesty is a serious offense and will not be tolerated in the class. Students are expected to reference sources appropriately in any work, including references to open source code found on-line, forum discussions, or third party software tools used in assignments or projects. Violation of the rules of academic integrity (e.g., plagiarism) from Columbia College or the Graduate School of the Arts and Sciences, will result in automatic failure of the course. Rules and consequences are outlined in Columbia College's Faculty Statement on Academic Integrity: http://www.college.columbia.edu/faculty/resourcesforinstructors/academicintegrity/statement.
http://www.college.columbia.edu/rightsandresponsibilities
Session 1: 1/22/2018
Lecture: introduction, syllabus, unix, bash, markdown, github
Assigned tasks: Link to session 1 repo.
Reading due: None
Assignment due: None
Code Review due: None
Session 2: 1/29/2018
Lecture: git, GitHub, installation, conda
Assigned tasks: Link to session 2 repo.
Reading due: Chapter 1-2: https://git-scm.com/book/en/v2
Assignment due: Link to session 1 repo.
Code Review due: None
Session 3: 2/5/2018
Lecture: git, GitHub, conda, jupyter, Python strings, lists.
Assigned tasks: Link to session 3 repo.
Reading due: Python basics chapters 1-4.
Assignment due: Link to session 2 repo.
Code Review due: None
Session 4: 2/12/2018
Lecture: Python dicts, hashing, mapping, '.py' files, importing
Assigned tasks: Link to session 4 repo
Reading due: Python basics chapters 5-13
Assignment due: 3-Python-Basics.
Code Review due: 3-Python-Basics.
Session 5: 2/19/2018
Lecture: Packaging, API, CLI, argparse, sublimetext.
Assigned tasks: 5-packaging/
Reading due: Python basics chapters 14-18 and Pythonista style guide.
Assignment due: 4-Python-advanced/Assignment
Code Review due: 4-Python-advanced/Code-Review.
Session 6: 2/26/2018
Lecture: Scientific Python, numpy, scipy, pandas.
Assigned tasks: Link to session 6 repo
Reading due: Numpy & Pandas user guide
Assignment due: 5-Python-Packaging
Code Review due: 5-Python-Packaging
Session 7: 3/5/2018
Lecture: Python as glue; subprocess, jupyter-tunneling, SSH
Assigned tasks: Link to session 7 repo
Reading due: None
Assignment due: 6-scientific-python
Code Review due: 6-scientific-python (revised)
SPRING BREAK
see project proposal guide here
PROJECT PROPOSALS DUE: 3/21/2018
Session 8: 3/19/2018
Lecture: web-scraping, HTML, requests, REST
Assigned tasks: Link to session 8 repo
Reading due: No reading due
Assignment due: Project proposals due 3/21
Code Review due: None
Session 9: 3/26/2018
Lecture: Analysis I: intro to machine learning
Reading due: Chapters 4 & 5 of Data Science Handbook
Assigned tasks: Notebooks 9.1-9.2
Assignment due: 'Records' repository from noteook 8.2 due 3/23
Code Review due: Fixed up 'Records' repository by 3/26
Session 10: 4/2/2018
Lecture: Analysis II: intro to maximum likelihood and Bayesian statistics
Assigned tasks: Notebooks 10.1-10.6
Reading due: gentle intro to Bayesian stats; Getting started page of pymc3 tutorial.
Assignment due: machine learning applied to records
dataframes.
Code Review due: None.
Session 11: 4/9/2018
Lecture: plotting in Python, plotting for the web, and plotting in general.
Assigned tasks: Notebooks 11.1-11.5
Reading due:
- Chapter 4 of Data Science Handbook
- Toyplot Tutorial & User Guide
- Bokeh example app guide: Part I, Part II, Part III.
Assignment due: Notebook 10.7 in assignment directory of the repo.
Code Review due: Notebook 10.7
Session 12: 4/16/2018
Lecture: genomics and parallel programming
Assigned tasks: Notebooks 12.1-12.5
Reading due:
Assignment due: Notebook 11.5
Code Review due: Code-review 11.5
Session 13: 4/23/2018
Lecture: Project Presentations and genomics part II
Assigned tasks: Notebooks 13.2-13.3
Reading due:
Assignment due: Notebooks 12.4 and 12.5
Code Review due: Code review 12.5
Session 14: 4/30/2018
Lecture: Project Presentations and genomics part III
Assigned tasks: Note
Reading due: None
Assignment due: None
Code Review due: 13-applied-Python-machine-learning
No Classes: 5/7/2018
No Classes: 5/11/2018
Final Projects due online: 5/11/2018