In 2017 I gave a talk called "Tricks for cleaning your data in R" which I presented at the Data+Narrative workshop at Boston University. The repo with the code and data I used for the talk was pretty well-received, so I figured I'd try to do some of the same stuff in Python using pandas
.
Disclaimer: when it comes to data stuff, I'm much better with R, especially the tidyverse
set of packages, than with Python, but in my last job I used Python's pandas
library to do a lot of data processing since Python was the dominant language there. Please feel free to let me know if there are better ways to do things!
- Python: website for Python
- pandas: website for the pandas library
- Jupyter: website for Project Jupyter, whose interactive notebook this tutorial was written in
- Pandas data cleaning tricks.ipynb: Jupyter notebook file (for viewing on the web - Desktop only)
- pandas-data-cleaning-tricks.pdf: PDF file (for printing out)
- pandas-data-cleaning-tricks.py: the straight-up Python code, with annotations commented otu
- employee-earnings-report-2016.csv: data on earnings for Boston's municipal employees, from the city's open data portal
- unemployment.xlsx: data on global unemployment rates from 2012 to 2016, from the International Monetary Fund
- attendees.csv: data on some attendees of the 2017 Data+Narrative workshop, with names and identifying information removed
- You can clone or download this repository by clicking on the green button above, "Clone or download"
- Follow along by reading the
.ipynb
file online or printing the.pdf
file out by clicking the Github links above
ychristinezhang at gmail dot com
or on Twitter
This work is licensed under a Creative Commons Attribution 4.0 International License.