Skip to content

Latest commit

 

History

History
20 lines (13 loc) · 2.1 KB

README.md

File metadata and controls

20 lines (13 loc) · 2.1 KB

This notebook is "inspired" (ie mostly copied) from Cameron David Pilon's Screencast on Predicting Ages from First Names which walks through his package demographica. One thing the screencast doesn't do is walk through the data-wrangling, so I'll do that here.

The point of this package is to take a list of first names and show the distribution by "Age Bin" (image below)

sample

  • Dataset is from catalog.data.gov, specifically Baby Names from Social Security Card Applications

  • We're using the Law of Total Probability to determine the age distribution of customers by their first name. This is helped by the fact that names tend to go in and out of style. For example (granted, it is an extreme example), 70% of girls with the name "Brittany" are between 25 and 34 yrs old (as of April 27, 2020).

  • The Law of Total Probability is defined as the equation below where: equation

    • P(name) is the probabilty of a given name (ie sum(quanity of people with name X) / sum(total quantity))
    • P(Age Bin | name) is the probability of an age bin given the name x (ie sum(quantity of name x in age bin z) / sum(quantity people with name x))
    • P(Age bin | name) P(name) = sum(quantity of quantity name x in bin) / sum(total quantity)

How to use

  1. If you have cloned / copied this and don't have any data, first run python DataLoader.py to download and unzip the necessary data
  2. Once you have the data, open the age_by_name notebook and run the cells.
  3. The ./data directory has a first_names.csv file that can be used to test the age_calculator function