Skip to content

Adding spatial data queries to Phoenix on HBase

Asterios edited this page Sep 15, 2016 · 6 revisions

Some databases such as Oracle have had spatial data modules for decades, but newer databases are often missing functionality we're used to having. Phoenix has brought back a lot of useful SQL functionality to HBase but one thing was missing which was spatial data queries. Just to be able to do that we had to double write to ElasticSearch when Hbase/Phoenix would have given us everything else we needed. So today I added that piece of functionality manually to our Phoenix installation. The trick is that indexing on latitude and longitude will not yield performance for distance radius searches, so you need geohashes, which are like tiles at different zoom levels. They are nested and have a short string representation with the larges tiles being the most significant digits of the string. We have a 100-200 million row semi-static table with about 18 columns to index on 10 nodes (using salt buckets and immutable rows).

I found the easy to use jgeohash library from alpharogroup.de and used it to populate a new geohash column computed from the latitude and longitude columns of the existing data:

GeoHashUtils.encode(lat.toDouble, lon.toDouble)

Using Spark, it had only taken an hour with 8 executors to load the data from HDFS flat files to Phoenix/HBase, and with a new secondary index on the new geohash column (plus INCLUDEd covered columns) and one other I added the time increased to about 1.5 hrs.

The next step was to construct the query to do a radius search using the new index. This convenient function gives you the adjacent tiles to do the filtered geographic radius search on:

    val areas: util.List[String] = GeoHashUtils.getAllAdjacentAreasList(g.substring(0, precision));

select
     id, latitude, longitude, ...
     from mytable where
     (GEOHASH like ?
     or GEOHASH like ?
     or GEOHASH like ?
     or GEOHASH like ?
     or GEOHASH like ?
     or GEOHASH like ?
     or GEOHASH like ?
     or GEOHASH like ?
     or GEOHASH like ?)
     and other_field = ?
     and (other_field2 = ? or other_field2 = ?)
     limit 100

The 'precision' is from the table in the Elastic search documentation the explains how many characters of the geohash you need to cover a particular size circle.

Setting the parameters of the prepared statement using LIKE since Phoenix didn't like using IN with LEFT(string, n):

  for
      (i <- (0 until areas.size())) {
        ps.setString(i + 1, areas.get(i)+"%")
      }

The geohash matches give you matches within a box, and to pare down to the radius circle, since I was using LIMIT, I didn't want to post-filter on the client to get the exact results since many rows could be lost as the box can be much larger than the circle. So I had to create a User Defined Function in Phoenix to add the additional filtering in the where clause (note the geohash index is still doing the bulk of the heavy lifting, paring down 100,000,000 to around 100-1000 rows):

    public class DistanceCalc extends org.apache.phoenix.expression.function.ScalarFunction {

    // See http://phoenix-hbase.blogspot.in/2013/04/how-to-add-your-own-built-in-function.html

    public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {

        List<Double> list = new ArrayList<Double>();
        for (int i = 0; i < 4; i++) {

            Expression arg = getChildren().get(i);
            if (!arg.evaluate(tuple, ptr)) {
                return false;
            }
            if (!arg.getDataType().equals(PDouble.INSTANCE)) return false;
            Double adouble = (Double)PDouble.INSTANCE.toObject(ptr, PDouble.INSTANCE);
            if (adouble == null) {
                return true;
            }
            list.add(adouble);
        }
        if (list.size() != 4) {
            return false;
        }

        Double result = de.alpharogroup.jgeohash.distance.DistanceCalculator.distance(
                list.get(0),
                list.get(1),
                list.get(2),
                list.get(3),
                MeasuringUnit.MILE);

        ptr.set(PDouble.INSTANCE.toBytes(result));
        return true;
    }

    public String getName() {
        return "MY_DISTANCE";
    }

    public PDataType getDataType() {
        return PDouble.INSTANCE;
    }

    }

pom.xml:

        <dependency>
            <groupId>org.apache.phoenix</groupId>
            <artifactId>phoenix-core</artifactId>
            <version>4.4.0-HBase-1.1</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>de.alpharogroup</groupId>
            <artifactId>jgeohash</artifactId>
            <version>1.4.6</version>
        </dependency>

CREATE FUNCTION MY_DISTANCE(double,double,double,double) 

returns double as 'com.myco.DistanceCalc' using jar 'hdfs://10.0.0.1/hbase-udf/udf-1.0-SNAPSHOT-jar-with-dependencies.jar'

so the query can include the additional line:

 and MY_DISTANCE(cast(latitude as double), cast(longitude as double), ?, ?) < ?

Where the additional parameters are the lat/lon of the search point and the radius in miles.

This all returned results in approximately the one second range using the Phoenix JDBC driver and setting the connection property: phoenix.functions.allowUserDefinedFunctions=true and the corresponding property in hbase-site.xml on the HBase server.

Additional link: https://phoenix.apache.org/udf.html#How_to_write_custom_UDF

Thanks to @threedliteguy for providing this great example.

Note

If you use the new jgeohash version 2.x.x you have to use the new class GeoHashExtensions instead of GeoHashUtils.

Clone this wiki locally