Exploring the CDC birth data files using Python.
The work can by broken into two parts:
-
Data prep and consolidation of all the records into several summary tables. The summary tables may be of use to others.
-
Analysis of the data, including production of several data visualizations
Note: Still a work in progress! I'll be updating as I go along. Here's a sample of the visualizations:
The data set is from the National Bureau of Economic Research (NBER), on their Vital Statistics Natality Birth Data page. NBER has provided csv's of the birth data, by year, from 1968 to 2020. The data is originally sourced from the CDC Vital Statistics, and totals ~19 GB, uncompressed.
I'm interested in seeing which months of the year are most popular for births. Answer: the summer!
You can reproduce the charts below by running the make figures
command. See the interactive chart of the violin plot on my blog, and run it in Colab.
Figure showing the percentage change in births by month. Note that the summer months are the most popular for giving birth.
Plotting the numbers by year and month is less useful, like in the figure below, but still interesting.
I've also created choropleth maps of the birth data, by state, for the years 1981-2020. I just need to fix some things and put them up....
Project will work well in a Linux and on a HPC environment. It will also work with MacOS (although not tested). Windows may require some minor changes by the user.
The following tables are currently generated:
-
births_simple.csv
- number of births, by year and month, from 1968 to 2020. Included in GitHub repo. -
births_simple_with_apgar.csv
- number of births, by year and month, including the APGAR score, from 1968 to 2020. Also included in repo. -
births_with_geo_apgar_consolidated.csv.gz
- number of births, with geography data and APGAR scores. Birth totals are grouped by geography, APGAR score, and date. Data is only from 1982 to 2004 (years where geo data is still publicly accessible). This file is included in repo (only 7 MB). -
births_with_geo_apgar.csv.gz
- number of births, with geography data and APGAR scores, but not grouped by. Not included in this repo cause of size (~90 MB).
All the figures can be generated from the first three csv's listed above. (I will include links to the Colab notebooks in the future)
To reproduce results:
- Clone this repo -
clone https://github.com/tvhahn/cdc-birth-data.git
- Create virtual environment. Assumes that Conda is installed.
- Linux/MacOS: use command from the Makefile in the root directory -
make create_environment
- Windows: from root directory -
conda env create -f envcdcbirth.yml
- HPC:
make create_environment
will detect HPC environment and automatically create environment frommake_hpc_venv.sh
. Tested on Compute Canada. Modifymake_hpc_venv.sh
for your own HPC cluster.
- Linux/MacOS: use command from the Makefile in the root directory -
...
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources. You'll find geocoding data here.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
│
├── models <- Models (don't have any at the moment)
│
├── notebooks <- Jupyter notebooks.
│ └── scratch <- Jupyter notebooks of questionable quality.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │
│ ├── features <- Not used at present
│ │
│ ├── models <- Not used at present
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
Ideas to consider:
- Identify anomalies by geography (although geo data may not be fine enough)
- Allow user to select any number of columns, and spit out new consolidated data file <-- this is my number one priority.
Any other suggestions, let me know! Happy to hack away at this with others. Cheers!