diff --git a/spatial/README_spatial_samples.txt b/spatial/README_spatial_samples.txt index 5d09d10..261d7d3 100644 --- a/spatial/README_spatial_samples.txt +++ b/spatial/README_spatial_samples.txt @@ -10,19 +10,24 @@ README file for Db2 Spatial Analytics Samples File: samples/spatial/README_spatial_samples.txt -The Db2 Spatial Analytics samples consist of one demo program. - - One sample is based on banking (branches, customers, employees). - This banking demo is written in SQL scripts run by the command-line - processor (CLP). +The Db2 Spatial Analytics samples consist of one demo program +and one jupyter notebook. + 1. The demo program (bank) sample is based on banking (branches, customers, employees). + This banking demo is written in SQL scripts run by the command-line + processor (CLP). + 2. The jupyter notebook (location) is based on using spatial data to find + a new location for company MYCO that is expanding. + This file briefly introduces the demo and indicates where to look for further information. +Note: as of Db2 V11.5.6 these samples are not part of a Db2 installation. = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = The Banking Demo is implemented in SQL scripts that are run with the Db2 command line processor. You can use the demo and scripts as a tutorial. The scripts and README file "saBankDemoREADME.txt" are located in the -"bank" subdirectory (sqllib/extenders/samples/spatial/bank). +"bank" subdirectory (samples/spatial/bank). The following excerpt from that file gives an introduction to the demo: ***************************************************************************** Banking Customer Analysis Sample @@ -59,3 +64,19 @@ After the Banking Demo runs, the complete record of its actions can be found in the file "sa_bank.log" which is in the "tmp", subdirectory under the home directory of the user who ran the demo. += = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = +The Location demo is implemented as Jupyter notebook along with a supporting +SQL script to load the data. +You can use the demo as tutorial into using Jupyter notebooks. +The scripts and README file "README.txt" are located in the +"location" subdirectory (samples/spatial/location). + +The demo is a Spatial Analytics Jupyter notebook version of the Spatial Extender demo found here: +https://www.ibm.com/blogs/cloud-archive/2015/08/location-location-location/ + +Files: +samples/location/location_demo.ipynb - the Jupyter notebook +samples/location/load_data.sql - support script to create the data tables run from the notebook +samples/location/README.txt - a more detailed README +samples/data/geo_county.zip - GEO_COUNTY table dataset +samples/data/geo_customer.zip - GEO_CUSTOMER table dataset diff --git a/spatial/data/geo_county.zip b/spatial/data/geo_county.zip new file mode 100644 index 0000000..c530ca3 Binary files /dev/null and b/spatial/data/geo_county.zip differ diff --git a/spatial/data/geo_customer.zip b/spatial/data/geo_customer.zip new file mode 100644 index 0000000..b5daa47 Binary files /dev/null and b/spatial/data/geo_customer.zip differ diff --git a/spatial/location/README.txt b/spatial/location/README.txt new file mode 100644 index 0000000..5d95c97 --- /dev/null +++ b/spatial/location/README.txt @@ -0,0 +1,69 @@ +README file for Db2 Spatial Analytics Location Sample + +* +* +* (C) COPYRIGHT INTERNATIONAL BUSINESS MACHINES CORPORATION 2021. +* + ALL RIGHTS RESERVED. +* + + +File: samples/spatial/location/README.txt + +The Db2 Spatial Analytics sample consists of a Jupyter notebook with +supporting SQL script init_env.sql. +This file briefly introduces each demo and indicates where to look for +further information. + + += = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = +The demo is implemented in Jupyter notebook creating and using a locally +created database. It uses the Db2 command line provess (CLP) +and the IBM Python driver to interact with the database and local instance. +You can use the demo and scripts as a tutorial to work with a Jupyter notebook, +perform queries and display data on a map. +The data used is located in spatial/samples/data and consists of two tables: +- a customer table containing customer (fake) information. +- a county table containing all US counties with census information. + +The script load_data.sql assumes that the CSV files with the data are co-located +with the notebook and scripts. Thus, prior to running the script in the notebook +extract the data +spatial/data/geo_customer.zip +spatial/data/geo_county.zip +followed by either copying the data to the directory of the script and notebook +or change the SQL script to point to the appropriate path for the files. + +The following excerpt from that file gives an introduction to the demo: +***************************************************************************** +This demo illustrates adding a spatial dimension to an existing information system. +The existing system did not contain any explicit location (spatial) data. +However, the existing system did contain implicit location data in the +form of addresses. By spatially enabling the existing database, +the user expands the business analysis capabilities of the system. + +This demo is a a jupyter notebook version of +https://www.ibm.com/blogs/cloud-archive/2015/08/location-location-location/ + +In this scenario, a small company (MYCO) has two offices, but business has been growing and there are +now customers across the country. Many of the customers have expressed a preference to meet company +representatives in person. The company owners want to explore where to open a new office. + +Some of the questions in MYCO company owners want to answer are: + +We already have some ideas where to open a new office. +- How can we find out which of these potential locations can serve the most customers? +- How can we reach the customers with the highest business volume? +- Are there other locations that should be considered? + +Spatial analysis functions can help find the answers. + +On Db2 Warehouse on Cloud the geospatial data used to bring this example to life can be found in the SAMPLE schema. +It contains data about customers in the GEO_CUSTOMER table and county data in the GEO_COUNTY table +in the Spatial Extender format and need conversion into the Spatial Analytics format first. +However, this notebook also works with Spatial Extender. Only the DB2GSE schema is necessary to be used in +queries for any spatial functions. +You can use the Tables menu to view the structure and browse the content of these tables. + +For more information on Spatial Analytics visit the documentation: +https://www.ibm.com/docs/en/db2/11.5?topic=data-db2-spatial-analytics \ No newline at end of file diff --git a/spatial/location/load_data.sql b/spatial/location/load_data.sql new file mode 100644 index 0000000..8ecb667 --- /dev/null +++ b/spatial/location/load_data.sql @@ -0,0 +1,62 @@ +--Prep the database for the location demo + +-- connect to bludb; + +DROP TABLE GEO_TEMP; +DROP TABLE GEO_COUNTY; +DROP TABLE GEO_CUSTOMER; + +-------------------------------------- +-- Create the SRS needed for the demo. +-------------------------------------- + +CALL ST_DROP_SRS('SAMPLE_GCS_WGS_1984'); +CALL ST_CREATE_SRS('SAMPLE_GCS_WGS_1984', 1005, -400, -400, 1.111948722222222E9,-100000,10000,-100000,10000,'GCS_WGS_1984',NULL, NULL,'GEOGCS[\"GCS_WGS_1984\",DATUM[\"D_WGS_1984\",SPHEROID[\"WGS_1984\",6378137.0,298.257223563]],PRIMEM[\"Greenwich\",0.0],UNIT[\"Degree\",0.0174532925199433]]', 'location demo srs'); + +------------------------ +-- Load GEO_COUNTY table +------------------------ + +-- Load raw data. +CREATE TABLE GEO_TEMP (OBJECTID INTEGER NOT NULL PRIMARY KEY, WKT CLOB, STATEFP VARCHAR(2), COUNTYFP VARCHAR(3), COUNTYNS VARCHAR(8), NAME VARCHAR(100), GEOID VARCHAR(5), NAMELSAD VARCHAR(100), LSAD VARCHAR(2), CLASSFP VARCHAR(2), MTFCC VARCHAR(5), CSAFP VARCHAR(3), CBSAFP VARCHAR(5), METDIVFP VARCHAR(5), FUNCSTAT VARCHAR(1), ALAND DECIMAL(14,0), AWATER DECIMAL(14,0), INTPTLAT VARCHAR(11), INTPTLON VARCHAR(12)) ORGANIZE BY ROW; + +-- Adjust the source path and the message path as necessary. +LOAD FROM ../data/county.del OF DEL LOBS FROM ./ MODIFIED BY COLDEL| MESSAGES /tmp/county_load.log INSERT INTO GEO_TEMP(OBJECTID, WKT, STATEFP, COUNTYFP, COUNTYNS, NAME, GEOID, NAMELSAD, LSAD, CLASSFP, MTFCC, CSAFP, CBSAFP, METDIVFP, FUNCSTAT, ALAND, AWATER, INTPTLAT, INTPTLON); + +-- Create and load county table. +CREATE TABLE GEO_COUNTY (OBJECTID INTEGER NOT NULL PRIMARY KEY, Shape SYSIBM.ST_MultiPolygon INLINE LENGTH 32300, STATEFP VARCHAR(2), COUNTYFP VARCHAR(3), COUNTYNS VARCHAR(8), NAME VARCHAR(100), GEOID VARCHAR(5), NAMELSAD VARCHAR(100), LSAD VARCHAR(2), CLASSFP VARCHAR(2), MTFCC VARCHAR(5), CSAFP VARCHAR(3), CBSAFP VARCHAR(5), METDIVFP VARCHAR(5), FUNCSTAT VARCHAR(1), ALAND DECIMAL(14,0), AWATER DECIMAL(14,0), INTPTLAT VARCHAR(11), INTPTLON VARCHAR(12), xmin double generated as (st_minx(shape)), xmax double generated as (st_maxx(shape)), ymin double generated as (st_miny(shape)), ymax double generated as (st_maxy(shape)) ) ORGANIZE BY COLUMN NOT LOGGED INITIALLY; + +INSERT INTO GEO_COUNTY (OBJECTID, SHAPE, STATEFP, COUNTYFP, COUNTYNS, NAME, GEOID, NAMELSAD, LSAD, CLASSFP, MTFCC, CSAFP, CBSAFP, METDIVFP, FUNCSTAT, ALAND, AWATER, INTPTLAT, INTPTLON) ( SELECT OBJECTID, ST_MPolyFromText(WKT, 1005), STATEFP, COUNTYFP, COUNTYNS, NAME, GEOID, NAMELSAD, LSAD, CLASSFP, MTFCC, CSAFP, CBSAFP, METDIVFP, FUNCSTAT, ALAND, AWATER, INTPTLAT, INTPTLON FROM GEO_TEMP); + +-- Create a regular index on the boxfilter columns. +CREATE INDEX GEO_COUNTY_BF_IDX ON GEO_COUNTY (xmin, ymin, xmax, ymax); + +COMMIT; +DROP TABLE GEO_TEMP; + +-------------------------- +-- Load GEO_CUSTOMER table +-------------------------- +-- Load raw data. +CREATE TABLE GEO_TEMP (OBJECTID INTEGER NOT NULL PRIMARY KEY, WKT VARCHAR(256), NAME VARCHAR(254), INSURANCE_VALUE INTEGER) ORGANIZE BY ROW; + +-- Adjust the source path and message path as necessary. +LOAD FROM ../data/customer.del OF DEL MODIFIED BY COLDEL| MESSAGES /tmp/customer_load.log INSERT INTO GEO_TEMP(OBJECTID, WKT, NAME, INSURANCE_VALUE); + +CREATE TABLE GEO_CUSTOMER (OBJECTID INTEGER NOT NULL PRIMARY KEY, SHAPE SYSIBM.ST_POINT, NAME VARCHAR(254), INSURANCE_VALUE INTEGER, xmin double generated as (st_minx(shape)), xmax double generated as (st_maxx(shape)), ymin double generated as (st_miny(shape)), ymax double generated as (st_maxy(shape))) ORGANIZE BY ROW NOT LOGGED INITIALLY; +INSERT INTO GEO_CUSTOMER (OBJECTID, SHAPE, NAME, INSURANCE_VALUE) ( SELECT OBJECTID ,ST_POINTFROMTEXT(WKT, 1005), NAME ,INSURANCE_VALUE FROM GEO_TEMP ); + +-- Create a regular index on the boxfilter columns. +CREATE INDEX GEO_CUSTOMER_BF_IDX ON GEO_CUSTOMER (xmin, ymin, xmax, ymax); + +-- Create a regular index on the INSURANCE_VALUE column. +CREATE INDEX GEO_CUSTOMER_insurance_value_idx ON GEO_CUSTOMER(INSURANCE_VALUE); + +COMMIT; +DROP TABLE GEO_TEMP; + + + + + + diff --git a/spatial/location/location_demo.ipynb b/spatial/location/location_demo.ipynb new file mode 100644 index 0000000..449310f --- /dev/null +++ b/spatial/location/location_demo.ipynb @@ -0,0 +1,828 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " # Location Analysis Sample\n", + "\n", + "This demo illustrates adding a spatial dimension to an existing\n", + "information system. The existing system did not contain any explicit\n", + "location (spatial) data. However, the existing system did contain implicit\n", + "location data in the form of addresses. By spatially enabling the existing\n", + "database, the user expands the business analysis capabilities of the system.\n", + "\n", + "\n", + "A jupyter notebook version of\n", + "https://www.ibm.com/blogs/cloud-archive/2015/08/location-location-location/\n", + " \n", + " \n", + "In this scenario, a small company (MYCO) has two offices, but business has been\n", + "growing and there are now customers across the country. Many of the customers have\n", + "expressed a preference to meet company representatives in person. The company owners\n", + "want to explore where to open a new office.\n", + "\n", + "Some of the questions in MYCO company owners want to answer are:\n", + "\n", + " We already have some ideas where to open a new office. How can we find out which of these potential locations can serve the most customers?\n", + " How can we reach the customers with the highest business volume?\n", + " Are there other locations that should be considered?\n", + "\n", + "Spatial analysis functions can help find the answers\n", + "\n", + "On Db2 Warehouse on Cloud the geospatial data used to bring this example to life can be found in the SAMPLE schema. It contains data about customers in the GEO_CUSTOMER table and county data in the GEO_COUNTIES table. You can use the Tables menu to view the structure and browse the content of these tables." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install Python Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + " !pip install folium ibm_db pandas geopandas" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup - Create a database " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Adjust the settings as necessary. For a local database the log file changes are necessary because the defaults\n", + "# are not sufficient to work with the data and queries.\n", + "\n", + "# Switch to warehouse configuration - column organized default and some other settings are different\n", + "#!db2set DB2_WORKLOAD=ANALYTICS\n", + "\n", + "#!db2 -v \"create database BLUDB using codeset UTF-8 territory US collate using identity pagesize 32 k\"\n", + "\n", + "# Make sure log files are big enough if on prem or local env out of the box\n", + "#!db2 update db cfg using logfilsiz 4000\n", + "#!db2 update db cfg using logprimary 10\n", + "#!db2 update db cfg using logsecondary 50" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup - Enable Spatial Analytics " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!db2 -v \"connect to BLUDB\"\n", + "!db2 -v \"CALL SYSPROC.SYSINSTALLOBJECTS('GEO', 'C', NULL, NULL)\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create and Insert data " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook assumes the data files for the tables are unzipped.\n", + "The zipped files can be found in the \"spatial/data\" directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## If the insert into GEO_CUSTOMER fails with\n", + "## SQL0901N The SQL statement or command failed because of a database system \n", + "## error. (Reason \"Max internal handles\".) SQLSTATE=58004\n", + "## Then set registry variable to bypass LOB handle issue on INSERT:\n", + "!db2set DB2_TCG_DEFAULT_OPTIONS=\"set may_release_lob_handle on\"\n", + "## After changing the registry variable the instance needs to be recycled.\n", + "!db2stop force\n", + "!db2start\n", + "\n", + "# Now connect and create tables and load data.\n", + "# The load_data.sql script assumes the data files are unzipped in \"../data/\"\n", + "# as this Jupyter notebook and script.\n", + "# The files are zipped and located in SAMPLE_PATH/spatial/data.\n", + "!db2 -v \"connect to BLUDB\"\n", + "!db2 -tvf load_data.sql" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Connect IBM_DB (Python) client with database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !db2start\n", + "\n", + "# Note: a connection can be local via CLP or via the Python. This notebook uses both intermingled.\n", + "import ibm_db\n", + "from ibm_db import connect\n", + "\n", + "# Local connection\n", + "connection = connect('DSN=BLUDB;', '', '')\n", + "\n", + "# For a remote connection using TCP - instance must be configured properly to listen to TCP connections\n", + "# via DB2COMM and DBM CFG SVCENAME/SSL_SVCENAME\n", + "\n", + "#connection = connect('DATABASE=BANKDEMO;'\n", + "# 'HOSTNAME=127.0.0.1;'\n", + "# 'PORT=50000;'\n", + "# 'PROTOCOL=SOCKETS;'\n", + "# 'UID=db2inst1;'\n", + "# 'PWD=password;', '', '')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Import Essential Libraries " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import folium\n", + "from folium import plugins\n", + "from shapely import wkt\n", + "import geopandas as gpd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Begin of scenario" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The current offices are in Pittsburgh and Atlanta. The locations are inserted with the datatype ST_POINT and, as in the following SQL statements, the spatial reference system with the id 1005 is used, which refers to the most popular coordinate system WGS1984:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "create_table_offices=\"CREATE TABLE IF NOT EXISTS offices ( \\\n", + " id integer not null primary key, \\\n", + " name varchar(30) not null, \\\n", + " location st_point not null, \\\n", + " contact varchar(128), \\\n", + " status integer not null default 0 \\\n", + ")\"\n", + "ibm_db.exec_immediate(connection,create_table_offices)\n", + "\n", + "# Two office locations\n", + "insert_table_offices=\"delete from offices; \\\n", + " insert into offices \\\n", + " values (1, 'Pittsburgh', st_point(-79.5835, 40.2623, 1005), 'Dan Smith',0), \\\n", + " (2, 'Atlanta', st_point(-84.2324, 33.4518, 1005), 'Jane Miller',0)\"\n", + "ibm_db.exec_immediate(connection,insert_table_offices)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The customer base has been growing in the west and southwest, so the MYCO owners consider opening an office in Austin, Dallas, Las Vegas, or San Jose. The question now is: Which of these locations is the best suited location as far as customers in its vicinity are concerned.\n", + "\n", + "To answer this question, the potential locations are added to the OFFICE table with status ‘1’ to indicate ‘planned’ status:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Add possible office locations\n", + "insert_table_offices=\" insert into offices \\\n", + " values (11, 'Las Vegas', st_point(-115.1739, 36.1215, 1005), 'tbd', 1), \\\n", + " (12, 'San Jose', st_point(-121.924 , 37.3591, 1005), 'tbd', 1), \\\n", + " (13, 'Austin', st_point(-97.76, 30.3208, 1005), 'tbd', 1), \\\n", + " (14, 'Dallas', st_point(-96.974, 32.6406, 1005), 'tbd', 1)\"\n", + "ibm_db.exec_immediate(connection,insert_table_offices)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Queries and visualization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is mainly the customers with high business volume that are interested in a local contact. If we define 200 miles as the maximum distance for this scenario, then the first step is to find out how many customers with a given business volume are located closer than 200 miles to existing and planned offices.\n", + "\n", + "Spatial Analytics supports two options to find distances, one is using the st_distance function and the other is a combination of st_buffer and st_intersects.\n", + "A quick performance check shows that using the first option\n", + "\n", + " st_intersects(st_buffer(off.location,200,'STATUTE MILE'), cust.shape) = 1\n", + "\n", + "results in a runtime of about 7mins.\n", + "\n", + "Using the second option \n", + "\n", + " st_distance(off.location, cust.shape, 'STATUTE MILE') < 200\n", + "\n", + "reduces the time to about 13s on the same data, same system, same environment.\n", + "\n", + "The second option is recommended when using a unit, here the ‘STATUTE MILE’, so we use the second option:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Could also create this query using a view. That is define new view that performs the select and then query data from the view.\n", + "query=\"select o1.name as office_name, o2.count_customers as customers, st_y(o1.location) as office_lat, st_x(o1.location) as office_lon \\\n", + " from offices o1, \\\n", + " (select off.id as office_id, off.name as office_name, count(*) as count_customers \\\n", + " from offices off, geo_customer cust \\\n", + " where cust.insurance_value > 200000 and \\\n", + " st_distance(off.location, cust.shape, 'STATUTE MILE') < 200 \\\n", + " group by off.id, off.name \\\n", + " order by count_customers) as o2 \\\n", + " where o1.id = o2.office_id order by customers desc\"\n", + "\n", + "# Takes about 13s on SMP in VM.\n", + "stmt = ibm_db.exec_immediate(connection, query)\n", + "result = ibm_db.fetch_both(stmt)\n", + "OFFICE_NAME = []\n", + "NUM_CUSTOMERS = []\n", + "OFFICE_LAT = []\n", + "OFFICE_LON = []\n", + "\n", + "# Read result into arrays.\n", + "while( result ):\n", + " OFFICE_NAME.append(result[0])\n", + " NUM_CUSTOMERS.append(result[1])\n", + " OFFICE_LAT.append(result[2])\n", + " OFFICE_LON.append(result[3])\n", + " result = ibm_db.fetch_both(stmt)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "numRows = len(OFFICE_NAME)\n", + "txt = \"Office num results = {}\"\n", + "print(txt.format(numRows))\n", + "\n", + "# Build a hash table from arrays.\n", + "location_office_data = {}\n", + "location_office_data['OFFICE_NAME'] = OFFICE_NAME\n", + "location_office_data['NUM_CUSTOMERS'] = NUM_CUSTOMERS\n", + "location_office_data['OFFICE_LAT'] = OFFICE_LAT\n", + "location_office_data['OFFICE_LON'] = OFFICE_LON\n", + "\n", + "\n", + "# Build data frame from hash table\n", + "df_location_office = pd.DataFrame(data=location_office_data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print the contents of the data frame\n", + "df_location_office[:numRows]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using folium to display results " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# folium library intialize base map\n", + "def generateBaseMap(default_location=[39.5, -98.35], default_zoom_start=4):\n", + " base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start, tiles=\"stamentoner\")\n", + " return base_map" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Clearly an office in Austin or Dallas will reach more customers than one in Las Vegas or San Jose, but still considerably less than in the already existing offices in Atlanta and Pittsburgh.\n", + "\n", + "In the following picture we have visualized this result on a map drawn." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "location_base_map = generateBaseMap()\n", + "for index, row in df_location_office.iterrows():\n", + " folium.CircleMarker([row['OFFICE_LAT'], row['OFFICE_LON']],\n", + " radius=(row['NUM_CUSTOMERS']/700), # in pixel\n", + " popup=row['OFFICE_NAME'],\n", + " fill_color=\"#3db7e4\", # divvy color\n", + " ).add_to(location_base_map)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "location_base_map" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From this initial exploration or analysis, a location in Dallas looks most promising. However, is this really the best place? We should probably open up the list of potential future office locations and find out whether there are more suitable locations. For instance, we may want to look at counties with the largest number of high business-volume customers. Since one office might serve multiple counties, the customers in neighboring counties should also be considered.\n", + "\n", + "For comparison, let’s first get the count for the existing and planned offices. To optimize the query runtime we use a two stage approach: in the first pass the st_envIntersects function reduces the candidates. The second pass finally checks which county contains the location:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# First WITH clause offloc - find id of the counties where the offices are located\n", + "# Second WITH clause custcount - count customers in these counties\n", + "# Third WITH clause offnei - find id of the neighboring counties where the offices are\n", + "# Fourth WITH clause neighborcount - count customers in neighboring counties\n", + "query=\" \\\n", + "with offloc(officeid, countyid) as \\\n", + "(select off.id, gc.objectid \\\n", + " from offices off, geo_county gc \\\n", + " where gc.xmin <= st_x(off.location) and \\\n", + " st_x(off.location) <= gc.xmax and \\\n", + " gc.ymin <= st_y(off.location) and \\\n", + " st_y(off.location) <= gc.ymax and \\\n", + " st_envIntersects(gc.shape, off.location) = 1 and \\\n", + " st_contains(gc.shape, off.location) = 1 \\\n", + "), \\\n", + "custcount (officeid, count_customers) as \\\n", + "( \\\n", + "select officeid, count(*) \\\n", + "from offloc ol, geo_county gc, geo_customer cust \\\n", + "where ol.countyid = gc.objectid and \\\n", + " cust.insurance_value > 200000 and \\\n", + " gc.xmin <= cust.xmax and \\\n", + " cust.xmin <= gc.xmax and \\\n", + " gc.ymin <= cust.ymax and \\\n", + " cust.ymin <= gc.ymax and \\\n", + " st_envIntersects(gc.shape, cust.shape) = 1 and \\\n", + " st_contains(gc.shape, cust.shape) = 1 \\\n", + "group by officeid \\\n", + "), \\\n", + "offnei(officeid, countyid) as \\\n", + "(select officeid, gnc.objectid \\\n", + " from offloc ol, geo_county gc, geo_county gnc \\\n", + " where ol.countyid = gc.objectid and \\\n", + " gc.objectid <> gnc.objectid and \\\n", + " st_touches(gc.shape, gnc.shape) = 1 \\\n", + "), \\\n", + "neighborcount (officeid, count_neighbors) as \\\n", + "( \\\n", + "select officeid, count(*) \\\n", + "from offnei on, geo_county gc, geo_customer cust \\\n", + "where on.countyid = gc.objectid and \\\n", + " cust.insurance_value > 200000 and \\\n", + " gc.xmin <= cust.xmax and \\\n", + " cust.xmin <= gc.xmax and \\\n", + " gc.ymin <= cust.ymax and \\\n", + " cust.ymin <= gc.ymax and \\\n", + " st_envIntersects(gc.shape, cust.shape) = 1 and \\\n", + " st_contains(gc.shape, cust.shape) = 1 \\\n", + "group by officeid \\\n", + ") \\\n", + "select cc.officeid, off.name, off.status, cc.count_customers, nc.count_neighbors \\\n", + "from custcount cc, neighborcount nc, offices off \\\n", + "where cc.officeid = nc.officeid and off.id = cc.officeid\"\n", + "\n", + "\n", + "stmt = ibm_db.exec_immediate(connection,query)\n", + "\n", + "result = ibm_db.fetch_both(stmt)\n", + "OFFICE_ID = []\n", + "OFFICE_NAME = []\n", + "OFFICE_STATUS = []\n", + "NUM_CUSTOMERS = []\n", + "NUM_NEIGHBORS =[]\n", + "while( result ):\n", + " OFFICE_ID.append(result[0])\n", + " OFFICE_NAME.append(result[1])\n", + " OFFICE_STATUS.append(result[2])\n", + " NUM_CUSTOMERS.append(result[3])\n", + " NUM_NEIGHBORS.append(result[4])\n", + " result = ibm_db.fetch_both(stmt)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "numRows = len(OFFICE_ID)\n", + "txt = \"Office num results = {}\"\n", + "print(txt.format(numRows))\n", + "\n", + "# Create data frame\n", + "location_data_2 = {}\n", + "location_data_2['OFFICE_ID'] = OFFICE_ID\n", + "location_data_2['OFFICE_NAME'] = OFFICE_NAME\n", + "location_data_2['NUM_CUSTOMERS'] = NUM_CUSTOMERS\n", + "location_data_2['NUM_NEIGHBORS'] = NUM_NEIGHBORS\n", + "df_location_data_2 = pd.DataFrame(data=location_data_2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_location_data_2[:numRows]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are now ready to explore other locations and query for the 10 counties with the highest customer count adding the customers in neighboring counties (be patient, this might take a minute to complete):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Find top 10 counties with customers\n", + "\n", + "# First WITH clause candidates - get the counties with most customers and their customer count\n", + "# Second WITH clause neighbors - get the neighboring counties of top 10 counties\n", + "# Third WITH clause neighborcount - count customers in neighboring counties\n", + "query=\" \\\n", + "with \\\n", + "candidates(objectid, name, count_customers) as \\\n", + "( \\\n", + "select county.objectid, county.name, count(*) as count_customers \\\n", + "from geo_customer cust, geo_county county \\\n", + "where insurance_value > 200000 and \\\n", + " county.xmin <= cust.xmax and \\\n", + " cust.xmin <= county.xmax and \\\n", + " county.ymin <= cust.ymax and \\\n", + " cust.ymin <= county.ymax and \\\n", + " st_contains(county.shape, cust.shape) = 1 \\\n", + "group by county.name, county.objectid \\\n", + "order by count_customers desc \\\n", + "fetch first 10 rows only \\\n", + "), \\\n", + "neighbors(candid, neighbor_id) as \\\n", + "( \\\n", + "select cand.objectid, gc2.objectid \\\n", + "from candidates cand, geo_county gc1, geo_county gc2 \\\n", + "where cand.objectid = gc1.objectid and \\\n", + " gc1.objectid <> gc2.objectid and \\\n", + " st_touches(gc1.shape, gc2.shape) = 1 \\\n", + "), \\\n", + "neighborcount(candid, count_neighbors) as \\\n", + "( \\\n", + "select candid, count(*) \\\n", + "from neighbors n, geo_county co, geo_customer cust \\\n", + "where n.neighbor_id = co.objectid and \\\n", + " insurance_value > 200000 and \\\n", + " co.xmin <= cust.xmax and \\\n", + " cust.xmin <= co.xmax and \\\n", + " co.ymin <= cust.ymax and \\\n", + " cust.ymin <= co.ymax and \\\n", + " st_contains(co.shape, cust.shape) = 1 \\\n", + "group by n.candid \\\n", + ") \\\n", + "select c.name, c.objectid, c.count_customers, nc.count_neighbors, st_astext(gc.shape) as wkt \\\n", + "from candidates c, neighborcount nc, geo_county gc \\\n", + "where c.objectid = nc.candid and c.objectid = gc.objectid \\\n", + "order by c.count_customers + nc.count_neighbors desc\"\n", + "\n", + "# Takes about 2min30s on SMP on VM.\n", + "stmt = ibm_db.exec_immediate(connection,query)\n", + "\n", + "result = ibm_db.fetch_both(stmt)\n", + "COUNTY_NAME = []\n", + "COUNTY_ID = []\n", + "NUM_CUSTOMERS = []\n", + "NUM_NEIGHBORS = []\n", + "COUNTY_WKT = []\n", + "while( result ):\n", + " COUNTY_NAME.append(result[0])\n", + " COUNTY_ID.append(result[1])\n", + " NUM_CUSTOMERS.append(result[2])\n", + " NUM_NEIGHBORS.append(result[3])\n", + " COUNTY_WKT.append(result[4])\n", + " result = ibm_db.fetch_both(stmt)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "numRows = len(COUNTY_NAME)\n", + "txt = \"Office num results = {}\"\n", + "print(txt.format(numRows))\n", + "\n", + "# Create data frame.\n", + "location_county_data = {}\n", + "location_county_data['COUNTY_NAME'] = COUNTY_NAME\n", + "location_county_data['COUNTY_ID'] = COUNTY_ID\n", + "location_county_data['NUM_CUSTOMERS'] = NUM_CUSTOMERS\n", + "location_county_data['NUM_NEIGHBORS'] = NUM_NEIGHBORS\n", + "location_county_data['COUNTY_WKT'] = COUNTY_WKT\n", + "df_location_county_data = pd.DataFrame(data=location_county_data)\n", + "# Convert WKT into geometry type.\n", + "df_location_county_data['COUNTY_WKT'] = df_location_county_data['COUNTY_WKT'].apply(wkt.loads)\n", + "# Create a GeoDataFrame.\n", + "df_gpd_location_county = gpd.GeoDataFrame(df_location_county_data, geometry='COUNTY_WKT')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Display the data frame data.\n", + "df_location_county_data[:numRows]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interesting. According to this analysis, a location in county Carroll would be best.\n", + "\n", + "See the following image that displays the results of the query that color-codes the number of customers in the county and its neighboring counties (red means many customers, blue means fewer customers). You can also see the neighboring counties that contribute to the total number of customers.\n", + "\n", + "Build a map showing 10 counties with most customers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "county_base_map = folium.Map(location=[39.5, -98.35], zoom_start=4, tiles='CartoDB positron')\n", + "\n", + "# Build the county ploygons of the 10 counties.\n", + "for index, row in df_gpd_location_county.iterrows():\n", + " # Extract each geometry as GeoSeries and convert to GeoJSON.\n", + " sim_geo = gpd.GeoSeries(row['COUNTY_WKT'])\n", + " geo_j = sim_geo.to_json()\n", + " combined_number = row['NUM_CUSTOMERS'] + row['NUM_NEIGHBORS']\n", + " txt = \"combined_number = {}\\n\".format(combined_number)\n", + " print(txt)\n", + " \n", + " # Crudely using different color codes\n", + " if (combined_number > 1800):\n", + " geo_j = folium.GeoJson(data=geo_j,\n", + " style_function=lambda x: {'fillColor': '#FE0002'})\n", + " elif (combined_number <= 1800 and combined_number > 1470):\n", + " geo_j = folium.GeoJson(data=geo_j,\n", + " style_function=lambda x: {'fillColor': '#A1015D'})\n", + " else:\n", + " geo_j = folium.GeoJson(data=geo_j,\n", + " style_function=lambda x: {'fillColor': '#0302FC'})\n", + " \n", + " txt = \"{0}: \\n Potential Customers = {1}\".format(row['COUNTY_NAME'], str(row['NUM_CUSTOMERS'] + row['NUM_NEIGHBORS']))\n", + " folium.Popup(txt).add_to(geo_j)\n", + " folium.Tooltip(row['COUNTY_NAME']).add_to(geo_j)\n", + " geo_j.add_to(county_base_map)\n", + "\n", + "# Show proposed locations.\n", + "for index, row in df_location_office.iterrows():\n", + " folium.CircleMarker([row['OFFICE_LAT'], row['OFFICE_LON']],\n", + " radius=(row['NUM_CUSTOMERS']/700), # in pixel\n", + " popup=row['OFFICE_NAME'],\n", + " fill_color=\"#3db7e4\", # divvy color\n", + " ).add_to(county_base_map)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Display the map with the added geometries.\n", + "county_base_map" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But let’s also check customers in a specified distance like we did earlier, using the center of the individual county." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Look in a 200 mile radius for the top 10 counties and get the centroid of the counties\n", + "# uses box filter with 3 degree expansion. 1 degree is about 70 statute miles in length.\n", + "\n", + "query=\" \\\n", + "select a.county_id, a.county_name, a.count_customers, st_y(b.centroid) as lat, st_x(b.centroid) as lon \\\n", + "from \\\n", + " (select cc.objectid as county_id, cc.name as county_name, count(*) as count_customers \\\n", + " from geo_county cc, geo_customer cust \\\n", + " where \\\n", + " cust.insurance_value > 200000 and \\\n", + " cc.xmin <= cust.xmax + 3.0 and \\\n", + " cust.xmin <= cc.xmax + 3.0 and \\\n", + " cc.ymin <= cust.ymax + 3.0 and \\\n", + " cust.ymin <= cc.ymax + 3.0 and \\\n", + " st_distance(cc.shape, cust.shape, 'STATUTE MILE') < 200 \\\n", + " group by cc.objectid, cc.name \\\n", + " order by count_customers desc) as a, \\\n", + " (select gc.objectid as oid, st_centroid(gc.shape) as centroid from geo_county gc) as b \\\n", + " where a.county_id = b.oid and a.county_id in (821, 646, 542, 698, 183, 1964, 327, 1075, 579, 418)\"\n", + "\n", + "# Takes about 1min15s on SMP VM.\n", + "stmt = ibm_db.exec_immediate(connection,query)\n", + "\n", + "result = ibm_db.fetch_both(stmt)\n", + "COUNTY_ID_2 = []\n", + "COUNTY_NAME_2 = []\n", + "NUM_CUSTOMERS_2 = []\n", + "COUNTY_CENTROID_LAT = []\n", + "COUNTY_CENTROID_LON = []\n", + "while( result ):\n", + " COUNTY_ID_2.append(result[0])\n", + " COUNTY_NAME_2.append(result[1])\n", + " NUM_CUSTOMERS_2.append(result[2])\n", + " COUNTY_CENTROID_LAT.append(result[3])\n", + " COUNTY_CENTROID_LON.append(result[4])\n", + " result = ibm_db.fetch_both(stmt)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create data frame.\n", + "location_county_data_2 = {}\n", + "location_county_data_2['COUNTY_ID'] = COUNTY_ID_2\n", + "location_county_data_2['COUNTY_NAME'] = COUNTY_NAME_2\n", + "location_county_data_2['NUM_CUSTOMERS'] = NUM_CUSTOMERS_2\n", + "location_county_data_2['COUNTY_CENTROID_LAT'] = COUNTY_CENTROID_LAT\n", + "location_county_data_2['COUNTY_CENTROID_LON'] = COUNTY_CENTROID_LON\n", + "df_location_county_data_2 = pd.DataFrame(data=location_county_data_2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Display the data frame data.\n", + "df_location_county_data_2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "county_base_map_2 = folium.Map(location=[39.5, -98.35], zoom_start=4, tiles='CartoDB positron')\n", + "# Show proposed locations.\n", + "for index, row in df_location_county_data_2.iterrows():\n", + " folium.CircleMarker([row['COUNTY_CENTROID_LAT'], row['COUNTY_CENTROID_LON']],\n", + " radius=(row['NUM_CUSTOMERS']/700), # in pixel\n", + " tooltip=row['COUNTY_NAME'],\n", + " fill_color=\"#3db7e4\", # divvy color\n", + " ).add_to(county_base_map_2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Display the map with the added geometries.\n", + "county_base_map_2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "According to this exploration, with both Lincoln and Douglas county located in Missouri, it seems we should look more closely at places in Missouri for our next planned office location." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# End of scenario" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}