Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] add note about setting checkpoint dir for DBSCAN #1744

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

james-willis
Copy link
Contributor

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

  • No:
    • this is a documentation update. The PR name follows the format [DOCS] my subject

What changes were proposed in this PR?

Added a note to the DBSCAN docs about setting the checkpoint dir

How was this patch tested?

pre-commit

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

@james-willis james-willis requested a review from jiayuasu as a code owner January 7, 2025 03:33
@github-actions github-actions bot added the docs label Jan 7, 2025
@james-willis james-willis force-pushed the dbscan-checkpoint-note branch from 09c22bf to 088c35b Compare January 7, 2025 03:34
@@ -858,6 +858,10 @@ The algorithm is available as a Scala and Python function called on a spatial da

The first parameter is the dataframe, the next two are the epsilon and min_points parameters of the DBSCAN algorithm.

!!!Note
The sparkContext's checkpoint directory must be set to use DBSCAN. Sedona's DBSCAN implementation uses Graphframes
which requires a checkpoint directory to be set. This can be done by calling `sparkContext.setCheckpointDir("path/to/checkpoint")`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a reference link about the checkPointDir? In addition, given that we have been using sedona (which is sparkSession), please provide a bit more code to illustrate how to get SparkContext (e.g., sedona.sc...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revise the use sedona.sparkContext. I didn't think the spark docs were very helpful tbh: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.setCheckpointDir.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@james-willis then we need to explain what a checkPointDir is via our doc. We should give examples about how to set this dir (locally, on S3, HDFS, ...). Distributed DBSCAN is highly anticipated by the community so we should make it easy to get started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set time with Matthew for tomorrow to pair on this.

@jiayuasu jiayuasu changed the title add note about setting checkpoint dir for DBSCAN [DOCS] add note about setting checkpoint dir for DBSCAN Jan 7, 2025
@james-willis
Copy link
Contributor Author

We need to add a note to the tutorial page as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants