Skip to content

This repo is to analyze the Chicago Taxi data and predict future tax revenue per taxi

Notifications You must be signed in to change notification settings

yihengli/chitaxi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chi-Taxi

This repo is to analyze the Chicago Taxi data and predict future tax revenue per taxi

How to use this code

This code repository is maintained as a python package. One can simply install it by using

pip install .

However, as it's still under development, it's recommended to use following command to install it

pip install -e .

In the editable mode, python won't try to move the entire folder into site-pacakge. A soft link will be created and any ad-hoc changes in the code can be direclty reflected in your work environments.

Jupyter / Ipython

For jupyter and ipython, you can open the reload option, then any adhoc changes will be dynamically loaded without reimport.

%load_ext autoreload
%autoreload 2

CLI Tools

When initializing the project, chitaxi can bind itself with a local folder, any input and output files can be directly looked up in the given workplace.

This can be done manually by revising the config.yaml file, or use CLI command

chitaxi --config-data PATH_TO_WORKPLACE
Options:
  --config-reset
  --config-data TEXT
  --clean-taxi TEXT
  --help              Show this message and exit.

How to load data

The original data is huge, around 6 - 10GB per month and can be downloaded from here: https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew/data

Simply put all downlaoded CSV files into one clean folder, use CLI can clean up the file and aggregate them into a HDF5 format dataset and put the h5 file into workspace.

Note:

  1. The dataset is large, expected final output is above 20GB
  2. If the RAM is limited, you have to split each CSV as small as possible. This can be done by using date filters on chicago website
chitaxi --clean-taxi PATH_TO_CSV_FOLDER

Alternatively, one can directly download the pre-cleaned h5 format here:

Under your workspace:

.
├── chitaxi.h5

Then you slice the data into any python kernel:

from chitaxi.datasets.loader import get_data_taxi

# Get data within a range
df = get_data_taxi(start='20140101', end='20140201')

# Get data for a specific year
df = get_data_taxi(year=2015)

Make sure you have the correct hdf format dataset in your CONFIG-DATA folder.

How to run model

The feature and label data can be directly downloaded from here

About

This repo is to analyze the Chicago Taxi data and predict future tax revenue per taxi

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages