Skip to content

Commit

Permalink
Use OSM data for geocoding in all boroughs (#179)
Browse files Browse the repository at this point in the history
* output direct to intersections.csv

* get_intersection_center

* rename intersection -> grid

* generate all-intersections file

* geocode broadway & 59

* extract avenue parsing code

* exact geocoding above 125

* checkpoint; this abstraction is breaking

* update test logs

* pare back logging, track parse vs. grid

* factor out a single grid.geocode_intersection function

* Sutton Place

* Riverside Drive

* update test logs

* clean up logging

* debug logging

* interpolate between streets

* track interpolations

* loosen up street matching

* parse ordinals; +26

* logging

* generate all intersections

* 49532 NYC intersections

* de-dupe on name; 46420 intersections

* move Grid into a class

* normalize Fifth Avenue -> 5th Avenue

* exact intersection geocoding for OSM (all boroughs)

* normalize second

* expand_abbrevs

* pare back normalization a bit

* require full match for ordinal rewrite

* fix St. Nicholas bug

* try stripping dirs

* bug fix

* try double-strip; probably overkill

* copy-paste mode and filters for geogpt batch

* prompt variation asking to avoid ave/ave intersection

* ask for an array response

* refactor coders

* update some tests

* fix special cases coder; able to repro current results

* rv irrelevant change

* pare back to direction stripping

* TODO

* default to images.ndjson

* Be more careful about matching "Park Ave" not just "Park"; handle Riverside Park

* fix odd 144 bug

* Exclude Central Park South/West/East/North

* pare back logging, update tests

* update data, stats

* stats, sizes for static site

* attempt to restore generate_intersections.csv

* write both

* I am confused

* never match St/Dr at start of street name

* so many intersection files

* generate all three

* St at start is always Saint, not Street

* update test stats

* pare back logging; geocode.py runs in ~5s

* update site stats

* ruff check

* no status bar

* one more place

* spell it right

* consistent rounding

* update data for rounding

* keep the old name
  • Loading branch information
danvk authored Nov 24, 2024
1 parent c78d46f commit 5b502eb
Show file tree
Hide file tree
Showing 19 changed files with 51,582 additions and 294 deletions.
10 changes: 5 additions & 5 deletions .github/workflows/e2etest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ jobs:
tar -xzf geocache.tgz
- name: Run geocoder
run: |
PYTHONPATH=. poetry run oldnyc/geocode/geocode.py --ids_filter test/random200-ids.txt --images_ndjson data/images.ndjson --output_format id-location.txt --geocode > >(tee test/random200-geocoded.txt) 2> >(tee test/random200.logs.txt >&2)
PYTHONPATH=. poetry run oldnyc/geocode/geocode.py --ids_filter test/random200-ids.txt --output_format id-location.txt --geocode --no-progress-bar > >(tee test/random200-geocoded.txt) 2> >(tee test/random200.logs.txt >&2)
# See https://stackoverflow.com/a/692407/388951 for the stdout/stderr redirection
- name: Generate intersections
run: |
export PYTHONPATH=.
poetry run python oldnyc/geocode/osm/generate_intersections.py > data/intersections.csv
poetry run python oldnyc/geocode/osm/generate_intersections.py
- name: Generate truth data
run: |
export PYTHONPATH=.
poetry run oldnyc/geocode/geocode.py --images_ndjson data/images.ndjson --output_format geojson --ids_filter data/geocode/random500-ids.txt --geocode > /tmp/images.geojson
poetry run oldnyc/geocode/geocode.py --output_format geojson --ids_filter data/geocode/random500-ids.txt --geocode > /tmp/images.geojson
poetry run oldnyc/geocode/truth/make_localturk_csv.py data/geocode/random500-ids.txt /tmp/images.geojson data/geocode/random500.csv
# We don't actually care about the diff on this file, just that make_localturk_csv.py doesn't error out.
git checkout data/geocode/random500.csv
Expand All @@ -32,7 +32,7 @@ jobs:
- name: Check performance on truth data
run: |
export PYTHONPATH=.
poetry run oldnyc/geocode/geocode.py --ids_filter data/geocode/truth-ids.txt --images_ndjson data/images.ndjson --output_format geojson --geocode > /tmp/actual.geojson
poetry run oldnyc/geocode/geocode.py --ids_filter data/geocode/truth-ids.txt --output_format geojson --geocode --no-progress-bar > /tmp/actual.geojson
poetry run oldnyc/geocode/calculate_metrics.py --stats_only --truth_data data/geocode/truth.geojson --computed_data /tmp/actual.geojson > test/geocode-performance.txt
- name: Check for diffs
run: |
Expand Down Expand Up @@ -84,7 +84,7 @@ jobs:
run: |
export PYTHONPATH=.
tar -xzf geocache.tgz
poetry run oldnyc/geocode/geocode.py --images_ndjson data/images.ndjson --lat_lon_map data/lat-lon-map.txt --output_format lat-lon-to-ids.json --geocode > data/lat-lon-to-ids.json 2> >(tee >( sed -n '/Finalizing/,$p' > test/geocoding-stats.txt) >&2)
poetry run oldnyc/geocode/geocode.py --lat_lon_map data/lat-lon-map.txt --output_format lat-lon-to-ids.json --geocode --no-progress-bar > data/lat-lon-to-ids.json 2> >(tee >( sed -n '/Finalizing/,$p' > test/geocoding-stats.txt) >&2)
- name: Generate static site
run: |
export PYTHONPATH=.
Expand Down
2 changes: 2 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ TODO:

## Repro instructions

If there are no instructions for a file here, check `e2etest.yml`.

### osm-roads.json

Run `data/nyc-named-roads.overpass-query.txt` through the Overpass API. This will produce a big JSON file that needs to be filtered. You can do this with:
Expand Down
Loading

0 comments on commit 5b502eb

Please sign in to comment.