To best utilize the Databricks environment, is to use spark and the advantages that come along with it. Here are a few examples of how Geospatial workflows can be optimized and what we have successfully implemented thus far.
- The first way to efficiently implement geospatial workflows in databricks is through libraries that extend spark into geospatial work.
- Many of these libraries are Scala and
- Another way of efficiently using Databricks environment is by using user defined functions on spark dataframes and delta tables.
- This was successfully implemented, however most of these benefits only become identifiable in large datasets
- Grid systems such as S2 and GeoHex divide spatial date into grids to better preform parallel computation.
- not yet implemented
ECCC provides the Meteorological Service of Canada along with open data. This information can be found here.
- This was a use case was an example provided by the MSC open data portal here. This pipeline uses Web Map Services (WMS) and uses temporal queries to display calculated results in multiple formats.
- We were able to directly implement the script into the databricks environment. To fully utlilze databricks, it is recommended that the data be placed within spark dataframes rather than pandas so that spark may run parallel jobs.
- The way the original script was set up, allowed for no improvments in performance even after the movement into dsitributed spark dataframe
- This was a use case was an example provided by the MSC open data portal here. This pipeline uses the GeoMet-OGC-API which leverages vector data in GeoJSON format using Open Geospatial Consortium standards.
- This script was implemented in the databricks environment
- The perfomance of the spark dataframes were much slower than the original code both run in the same environment
- This is due to the original script gave each station a separate dataframe meaning the splitting up of the already small dataframes are unessecary
- It was also difficult to fully test the performance on this script since only two parts fully utilize the dataset
- Look to Optimization Ideas for how we could optimize performance
- We ran into problems resolving dependencies with the mapping library cartopy. Specifically, the GEOS library. While this may be fixable in the environment the folium library to map the resulting data.
- The folium library uses the leaflet.js library and can be easily implemented using python web development tools such as flask and django (example can be found here)
- This means that web mapping examples provided by ECCC can also be implemented in databricks
- Due to the original script which split up the data into stations with a dataframe for each station, spark underpreformed compared to the original script which was built for a single node
- This will be a common issue with moving scripts into the spark environment and something we will have to think about to effectively use Databricks
- This script is an example of how the databricks environment can be efficiently implemented in a way that maximizes it's benefits
- We used the GeoMet API and delta tables to implement spark jobs to demonstrate the strengths of Databricks and spark
- While comparing Pandas dataframes and spark delta tables, the spark methods of splitting, operating and joining together made runtimes longer for small files of less than 500,000 records
- However this can be tested on larger datasets with UDFs to test runtime differences
- Optimize scripts using spark dataframes and delta tables
- Better test performance by using larger dataset
- Find a way to implement cartopy for mapping of
- The required step here is to install LibGEOS onto the cluster or the notebook
- Implement all stations as a single dataframe
- Store API items in a delta lake table
- Decrease the shuffle partition size
- increased runtime sql queries around 65%
Using the Canadian Space Agencies Repo, scripts where moved in a Databricks environment to demonstrate RADARSAT-1 satetlite imagery. More details of this project can be found here.