Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-689] Geostats SQL #1736

Merged
merged 2 commits into from
Jan 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions docs/api/sql/Function.md
Original file line number Diff line number Diff line change
Expand Up @@ -708,6 +708,29 @@ Output:
32618
```

## ST_BinaryDistanceBandColumn

Introduction: Introduction: Returns a `weights` column containing every record in a dataframe within a specified `threshold` distance.

The `weights` column is an array of structs containing the `attributes` from each neighbor and that neighbor's weight. Since this is a binary distance band function, weights of neighbors within the threshold will always be
`1.0`.

Format: `ST_BinaryDistanceBandColumn(geometry:Geometry, threshold: Double, includeZeroDistanceNeighbors: boolean, includeSelf: boolean, useSpheroid: boolean, attributes: Struct)`

Since: `v1.7.1`

SQL Example

```sql
ST_BinaryDistanceBandColumn(geometry, 1.0, true, true, false, struct(id, geometry))
````

Output:

```
[{{15, POINT (3 1.9)}, 1.0}, {{16, POINT (3 2)}, 1.0}, {{17, POINT (3 2.1)}, 1.0}, {{18, POINT (3 2.2)}, 1.0}]
```

## ST_Boundary

Introduction: Returns the closure of the combinatorial boundary of this Geometry.
Expand Down Expand Up @@ -1107,6 +1130,31 @@ true
!!!Warning
For geometries that span more than 180 degrees in longitude without actually crossing the Date Line, this function may still return true, indicating a crossing.

## ST_DBSCAN

Introduction: Performs a DBSCAN clustering across the entire dataframe.

Returns a struct containing the cluster ID and a boolean indicating if the record is a core point in the cluster.

- `epsilon` is the maximum distance between two points for them to be considered as part of the same cluster.
- `minPoints` is the minimum number of neighbors a single record must have to form a cluster.

Format: `ST_DBSCAN(geom: Geometry, epsilon: Double, minPoints: Integer)`

Since: `v1.7.1`

SQL Example

```sql
SELECT ST_DBSCAN(geom, 1.0, 2)
```

Output:

```
{true, 85899345920}
```

## ST_Degrees

Introduction: Convert an angle in radian to degrees.
Expand Down Expand Up @@ -1874,6 +1922,31 @@ Output:
ST_LINESTRING
```

## ST_GLocal

Introduction: Runs Getis and Ord's G Local (Gi or Gi*) statistic on the geometry given the `weights` and `level`.

Getis and Ord's Gi and Gi* statistics are used to identify data points with locally high values (hot spots) and low
values (cold spots) in a spatial dataset.

The `ST_WeightedDistanceBand` and `ST_BinaryDistanceBand` functions can be used to generate the `weights` column.

Format: `ST_GLocal(geom: Geometry, weights: Struct, level: Int)`

Since: `v1.7.1`

SQL Example

```sql
ST_GLocal(myVariable, ST_BinaryDistanceBandColumn(geometry, 1.0, true, true, false, struct(myVariable, geometry)), true)
```

Output:

```
{0.5238095238095238, 0.4444444444444444, 0.001049802637104223, 2.4494897427831814, 0.00715293921771476}
```

## ST_H3CellDistance

Introduction: return result of h3 function [gridDistance(cel1, cell2)](https://h3geo.org/docs/api/traversal#griddistance).
Expand Down Expand Up @@ -2657,6 +2730,34 @@ Output:
LINESTRING (69.28469348539744 94.28469348539744, 100 125, 111.70035626068274 140.21046313888758)
```

## ST_LocalOutlierFactor

Introduction: Computes the Local Outlier Factor (LOF) for each point in the input dataset.

Local Outlier Factor is an algorithm for determining the degree to which a single record is an inlier or outlier. It is
based on how close a record is to its `k` nearest neighbors vs how close those neighbors are to their `k` nearest
neighbors. Values substantially less than `1` imply that the record is an inlier, while values greater than `1` imply that
the record is an outlier.

!!!Note
ST_LocalOutlierFactor has a useSphere parameter rather than a useSpheroid parameter. This function thus uses a spherical model of the earth rather than an ellipsoidal model when calculating distance.

Format: `ST_LocalOutlierFactor(geometry: Geometry, k: Int, useSphere: Boolean)`

Since: `v1.7.1`

SQL Example

```sql
SELECT ST_LocalOutlierFactor(geometry, 5, true)
```

Output:

```
1.0009256283408587
```

## ST_LocateAlong

Introduction: This function computes Point or MultiPoint geometries representing locations along a measured input geometry (LineString or MultiLineString) corresponding to the provided measure value(s). Polygonal geometry inputs are not supported. The output points lie directly on the input line at the specified measure positions.
Expand Down Expand Up @@ -4416,6 +4517,28 @@ Output:
GEOMETRYCOLLECTION(POLYGON((-1 2,2 -1,-1 -1,-1 2)),POLYGON((-1 2,2 2,2 -1,-1 2)))
```

## ST_WeightedDistanceBandColumn

Introduction: Introduction: Returns a `weights` column containing every record in a dataframe within a specified `threshold` distance.

The `weights` column is an array of structs containing the `attributes` from each neighbor and that neighbor's weight. Since this is a distance weighted distance band, weights will be distance^alpha.

Format: `ST_WeightedDistanceBandColumn(geometry:Geometry, threshold: Double, alpha: Double, includeZeroDistanceNeighbors: boolean, includeSelf: boolean, selfWeight: Double, useSpheroid: boolean, attributes: Struct)`

Since: `v1.7.1`

SQL Example

```sql
ST_WeightedDistanceBandColumn(geometry, 1.0, -1.0, true, true, 1.0, false, struct(id, geometry))
````

Output:

```
[{{15, POINT (3 1.9)}, 1.0}, {{16, POINT (3 2)}, 9.999999999999991}, {{17, POINT (3 2.1)}, 4.999999999999996}, {{18, POINT (3 2.2)}, 3.3333333333333304}]
```

## ST_X

Introduction: Returns X Coordinate of given Point null otherwise.
Expand Down
183 changes: 183 additions & 0 deletions python/sedona/sql/st_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from typing import Optional, Union

from pyspark.sql import Column
from pyspark.sql.functions import lit

from sedona.sql.dataframe_api import (
ColumnOrName,
Expand Down Expand Up @@ -2462,6 +2463,188 @@ def ST_InterpolatePoint(geom1: ColumnOrName, geom2: ColumnOrName) -> Column:
return _call_st_function("ST_InterpolatePoint", args)


@validate_argument_types
def ST_DBSCAN(
geometry: ColumnOrName,
epsilon: Union[ColumnOrName, float],
min_pts: Union[ColumnOrName, int],
use_spheroid: Optional[Union[ColumnOrName, bool]] = False,
) -> Column:
"""Perform DBSCAN clustering on the given geometry column.

@param geometry: Geometry column or name
:type geometry: ColumnOrName
@param epsilon: the distance between two points to be considered neighbors
:type epsilon: ColumnOrName
@param min_pts: the number of neighbors a point should have to form a cluster
:type min_pts: ColumnOrName
@param use_spheroid: whether to use spheroid for distance calculation
:type use_spheroid: ColumnOrName
@return: A struct indicating the cluster to which the point belongs and whether it is a core point
"""

if isinstance(epsilon, float):
epsilon = lit(epsilon)

if isinstance(min_pts, int):
min_pts = lit(min_pts)

if isinstance(use_spheroid, bool):
use_spheroid = lit(use_spheroid)

return _call_st_function("ST_DBSCAN", (geometry, epsilon, min_pts, use_spheroid))


@validate_argument_types
def ST_LocalOutlierFactor(
geometry: ColumnOrName,
k: Union[ColumnOrName, int],
use_spheroid: Optional[Union[ColumnOrName, bool]] = False,
) -> Column:
"""Calculate the local outlier factor on the given geometry column.

@param geometry: Geometry column or name
:type geometry: ColumnOrName
@param k: the number of neighbors to use for LOF calculation
:type k: ColumnOrName
@param use_spheroid: whether to use spheroid for distance calculation
:type use_spheroid: ColumnOrName
@return: A Double indicating the local outlier factor of the point
"""

if isinstance(k, int):
k = lit(k)

if isinstance(use_spheroid, bool):
use_spheroid = lit(use_spheroid)

return _call_st_function("ST_LocalOutlierFactor", (geometry, k, use_spheroid))


@validate_argument_types
def ST_GLocal(
x: ColumnOrName,
weights: ColumnOrName,
star: Optional[Union[ColumnOrName, bool]] = False,
) -> Column:
"""Calculate Getis Ord Gi(*) statistics on the given column.

@param x: The variable we want to compute Gi statistics for
:type x: ColumnOrName
@param weights: the weights array containing the neighbors, their weights, and their values of x
:type weights: ColumnOrName
@param star: whether to use the focal observation in the calculations
:type star: ColumnOrName
@return: A struct containing the Gi statistics including a p value
"""

if isinstance(star, bool):
star = lit(star)

return _call_st_function("ST_GLocal", (x, weights, star))


@validate_argument_types
def ST_BinaryDistanceBandColumn(
geometry: ColumnOrName,
threshold: ColumnOrName,
include_zero_distance_neighbors: Union[ColumnOrName, bool] = True,
include_self: Union[ColumnOrName, bool] = False,
use_spheroid: Union[ColumnOrName, bool] = False,
attributes: ColumnOrName = None,
) -> Column:
"""Creates a weights column containing the other records within the threshold and their weight.

Weights will always be 1.0.


@param geometry: name of the geometry column
@param threshold: Distance threshold for considering neighbors
@param include_zero_distance_neighbors: whether to include neighbors that are 0 distance.
@param include_self: whether to include self in the list of neighbors
@param use_spheroid: whether to use a cartesian or spheroidal distance calculation. Default is false
@param attributes: the attributes to save in the neighbor column.

"""
if isinstance(include_zero_distance_neighbors, bool):
include_zero_distance_neighbors = lit(include_zero_distance_neighbors)

if isinstance(include_self, bool):
include_self = lit(include_self)

if isinstance(use_spheroid, bool):
use_spheroid = lit(use_spheroid)

return _call_st_function(
"ST_BinaryDistanceBandColumn",
(
geometry,
threshold,
include_zero_distance_neighbors,
include_self,
use_spheroid,
attributes,
),
)


@validate_argument_types
def ST_WeightedDistanceBandColumn(
geometry: ColumnOrName,
threshold: ColumnOrName,
alpha: Union[ColumnOrName, float],
include_zero_distance_neighbors: Union[ColumnOrName, bool] = True,
include_self: Union[ColumnOrName, bool] = False,
self_weight: Union[ColumnOrName, float] = 1.0,
use_spheroid: Union[ColumnOrName, bool] = False,
attributes: ColumnOrName = None,
) -> Column:
"""Creates a weights column containing the other records within the threshold and their weight.

Weights will be distance^alpha.


@param geometry: name of the geometry column
@param threshold: Distance threshold for considering neighbors
@param alpha: alpha to use for inverse distance weights. Computation is dist^alpha. Default is -1.0
@param include_zero_distance_neighbors: whether to include neighbors that are 0 distance. If 0 distance neighbors are
included, values are infinity as per the floating point spec (divide by 0)
@param include_self: whether to include self in the list of neighbors
@param self_weight: the value to use for the self weight. Default is 1.0
@param use_spheroid: whether to use a cartesian or spheroidal distance calculation. Default is false
@param attributes: the attributes to save in the neighbor column.

"""
if isinstance(alpha, float):
alpha = lit(alpha)

if isinstance(include_zero_distance_neighbors, bool):
include_zero_distance_neighbors = lit(include_zero_distance_neighbors)

if isinstance(include_self, bool):
include_self = lit(include_self)

if isinstance(self_weight, float):
self_weight = lit(self_weight)

if isinstance(use_spheroid, bool):
use_spheroid = lit(use_spheroid)

return _call_st_function(
"ST_WeightedDistanceBandColumn",
(
geometry,
threshold,
alpha,
include_zero_distance_neighbors,
include_self,
self_weight,
use_spheroid,
attributes,
),
)


# Automatically populate __all__
__all__ = [
name
Expand Down
6 changes: 6 additions & 0 deletions python/sedona/stats/clustering/dbscan.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ def dbscan(
geometry: Optional[str] = None,
include_outliers: bool = True,
use_spheroid=False,
is_core_column_name="isCore",
cluster_column_name="cluster",
):
"""Annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.

Expand All @@ -49,6 +51,8 @@ def dbscan(
include_outliers: whether to return outlier points. If True, outliers are returned with a cluster value of -1.
Default is False
use_spheroid: whether to use a cartesian or spheroidal distance calculation. Default is false
is_core_column_name: what the name of the column indicating if this is a core point should be. Default is "isCore"
cluster_column_name: what the name of the column indicating the cluster id should be. Default is "cluster"

Returns:
A PySpark DataFrame containing the cluster label for each row
Expand All @@ -62,6 +66,8 @@ def dbscan(
geometry,
include_outliers,
use_spheroid,
is_core_column_name,
cluster_column_name,
)

return DataFrame(result_df, sedona)
Loading
Loading