Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] add note about setting checkpoint dir for DBSCAN #1744

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/tutorial/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -858,6 +858,10 @@ The algorithm is available as a Scala and Python function called on a spatial da

The first parameter is the dataframe, the next two are the epsilon and min_points parameters of the DBSCAN algorithm.

!!!Note
The sparkContext's checkpoint directory must be set to use DBSCAN. Sedona's DBSCAN implementation uses Graphframes
which requires a checkpoint directory to be set. This can be done by calling `sparkContext.setCheckpointDir("path/to/checkpoint")`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a reference link about the checkPointDir? In addition, given that we have been using sedona (which is sparkSession), please provide a bit more code to illustrate how to get SparkContext (e.g., sedona.sc...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revise the use sedona.sparkContext. I didn't think the spark docs were very helpful tbh: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.setCheckpointDir.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@james-willis then we need to explain what a checkPointDir is via our doc. We should give examples about how to set this dir (locally, on S3, HDFS, ...). Distributed DBSCAN is highly anticipated by the community so we should make it easy to get started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set time with Matthew for tomorrow to pair on this.


=== "Scala"

```scala
Expand Down
Loading