We explore and analyze Uber pick up data for New York City. The data is from FiveThirtyEight. The data includes the following for each pickup:
- Date/Time
- Latitude
- Longitude
Notebooks use statsmodels.py
which has a dependency on scipy.py 1.2.0
. So we need
to use a virtual environment.
Make sure you have virtualenv.py
; if not, then run
pip install virtualenv
First, we install the proper version of scipy
in the virtual environment and install a jupyter
kernel for the virtual environment. From the project directory, run the following:
- Run
virtualenv --system-site-packages venv
. - Activate the virtual environment. For example on Windows, run
venv\Scripts\activate
. - From the virtual environment, run
pip install scipy==1.2.0
. - From the virtual environment, run
python -m ipykernel install --user --name=uber_data
. - Exit the virtual environment by running
deactivate
.
Now, when we run jupyter notebook
(inside or outside the virtual environment), just make sure
you are using the uber_data
kernel (and NOT the default kernel, e.g. Python 3
).
For example, using the menus in the notebook, go to Kernel > Change kernel > uber_data
.
We have the following notebooks:
Group the pickup data to get counts of how many pickups occur each hour (so 24 in a single day). Model these counts based on the day of the week. Can be used to make strategic decisions related to how busy drivers are at different points in the week (e.g. what are the most popular times of the week).
Here is the resulting model:
The model was selected from several variations using cross-validation where the hold-out set is
scored based on the mean log-likelihood of the model. The final model uses that the conditional
distribution of y = hourly pickup count
given x = day of week (0 is Monday)
is a negative
binomial distribtuion.
The negative binomial distribution is a common model for Poisson-like
data that has too large a
variance, i.e. over-dispersion. The idea is that it is a poisson model with a latent (gamma)
distribution for the mean.
Group the pickup data by the date and hour. Look at trends in the counts of the number of pickups for a given hour in the day. Look at splitting by days that are weekends, weekdays, etc. For example, we have the following for hourly counts for Monday through Thursday (Friday nights should have a different behavior).