Spark-Enabled Cost-Distance #1999

jamesmcclain · 2017-02-02T00:47:02Z

Simplified cost-distance implementation
Implementation of distributed cost-distance algorithm
Unit Tests
Comments

Fixes #1980

lossyrob · 2017-02-02T20:21:00Z

jamesmcclain · 2017-02-02T20:44:05Z

🎉 🎉 🎉 🎉 🎉 🎉 🎉

lossyrob · 2017-02-06T16:03:08Z

spark/src/main/scala/geotrellis/spark/costdistance/MrGeoCostDistance.scala

+  *
+  * 2. https://github.com/ngageoint/mrgeo/blob/0c6ed4a7e66bb0923ec5c570b102862aee9e885e/mrgeo-mapalgebra/mrgeo-mapalgebra-costdistance/src/main/scala/org/mrgeo/mapalgebra/CostDistanceMapOp.scala
+  */
+object MrGeoCostDistance {


I think the reference in the doc string is sufficient pointing back to Mr Geo, the type names should probably just be CostDistance

lossyrob · 2017-02-06T16:05:10Z

spark/src/main/scala/geotrellis/spark/costdistance/MrGeoCostDistance.scala

+          (col, row, friction, 0.0)
+        })
+      accumulator.add((key, costs))
+    })


I see the loop version does a count to force the execution and cache the RDD. Is it possible to avoid this foreach with a similar method, where this accumulator is moved to the map, that RDD persisted, and the RDD counted to force execution? Perhaps not the best optimization, but your opinion on this would probably help me understand the logic better.

The foreach is strictly for the side-effects (to set the accumulator).

Gotcha, but it causes a double iteration of the RDD. The accumulator could be set in the map, and then the map executed with caching, so that the RDD transformation is cached and iterated over once, and the accumulator value is set.

jamesmcclain · 2017-02-09T14:53:16Z

All comments addressed.

lossyrob

Looking good, the last thing is the performance question of the single threaded cost distance changes

lossyrob · 2017-02-21T12:57:17Z

raster/src/main/scala/geotrellis/raster/costdistance/CostDistance.scala

-            val cost2 = calcCost(c1, r1, dir(c, r), cost)
-            if (cost2.isDefined) {
-              curMinCost = math.min(curMinCost, source + cost1.get + cost2.get)
+  def compute(


Have you benchmarked the single-threaded case for this change? It would be good to get numbers to prove this method works faster/just as fast.

I have not run any benchmarks, but (informally) I did not detect any noticeable speed difference.

That having been said, it is possible (even likely) that there is a difference, because the changes that I made were very self-consciously deoptimizations.

That was necessary because the previous algorithm seemed to avoid putting elements into the priority queue whenever it could. That was laudable in the single-threaded case, but in order to maintain coherence in the case were points need to be transferred from some adjacent tile to the present one, I think that it makes sense to have one and only one place for points to enter (to ensure that points coming in from adjacent tiles and ponits that were already in the tile are treated exactly the same).

After getting the basics working, I then took another pass an re-added some mild optimizations. Generally speaking, I am pretty comfortable with were the single-threaded version is.

lossyrob · 2017-02-21T13:00:31Z

spark/src/main/scala/geotrellis/spark/costdistance/IterativeCostDistance.scala

+    */
+  def apply[K: (? => SpatialKey), V: (? => Tile)](
+    friction: RDD[(K, V)] with Metadata[TileLayerMetadata[K]],
+    points: Seq[Point],


Discussion point, not a change request: What would it take to create an option to pass in an RDD of points? I'm curious if there are use cases for this, and if it would be possible with the current algorithm.

I think that there would be some mechanical changes needed in order to support an RDD of points, probably the friction layer and the points would need to be joined (either logically or literally) as the first step.

As a practical matter, I do not think that supporting RDDs of points would add any real capability. If someone really did have so many points that they do not fit into memory on one machine and some appreciable percentage of those points have unique projections into the raster, then it is likely that the raster layer is so large (in terms of pixels) that this algorithm would run out of driver memory before even reaching the second iteration (driver memory requirements are a function of the total layer size -- this is unavoidable because propagation across tile boundaries must be coordinated by the driver).

On the other hand, if someone has a source RDD whose points can be clustered (or otherwise reduced) into a fairly small list, then that would be viable option (but that would be unrelated to this work).

jamesmcclain · 2017-02-21T13:35:04Z

See jamesmcclain/CostDistance#11 and geotrellis/geotrellis-libya-weighted-overlay#4

pomadchin

I'm wondering about ident in atomic expressions like 1+1; should it be fixed to 1 + 1? (code style question)

pomadchin · 2017-03-22T12:16:06Z

raster/src/main/scala/geotrellis/raster/costdistance/CostDistance.scala

+    */
+  def generateEmptyQueue(cols: Int, rows: Int): Q = {
+    new PriorityQueue(
+      (cols*16 + rows*16), new java.util.Comparator[Cost] {


Unnecessary parentheses.

pomadchin · 2017-03-22T12:16:44Z

raster/src/main/scala/geotrellis/raster/costdistance/CostDistance.scala

+    *    "Propagating radial waves of travel cost in a grid."
+    *    International Journal of Geographical Information Science 24.9 (2010): 1391-1413.
+    *
+    * @param  friction    Friction tile; pixels are interpreted as "second per meter"


@param frictionTile

pomadchin · 2017-03-22T12:24:41Z

raster/src/main/scala/geotrellis/raster/costdistance/CostDistance.scala

+    val costTile = generateEmptyCostTile(cols, rows)
+    val q: Q = generateEmptyQueue(cols, rows)
+
+    points.foreach({ case (col, row) =>


cfor here would be a bit faster (according to benchmarks on aspect-tif.tif):

cfor(0)(_ < points.length, _ + 1) { i => val (col, row) = points(i) q.add((col, row, frictionTile.getDouble(col, row), 0.0)) }

pomadchin · 2017-03-22T12:28:37Z

raster/src/main/scala/geotrellis/raster/costdistance/CostDistance.scala

+    require(frictionTile.dimensions == costTile.dimensions)
+
+    def inTile(col: Int, row: Int): Boolean =
+      ((0 <= col && col < cols) && (0 <= row && row < rows))


Unnecessary parentheses here too, and in methods below.

pomadchin · 2017-03-22T12:29:30Z

raster/src/main/scala/geotrellis/raster/costdistance/CostDistance.scala

+      * @param  col           The column of the given location
+      * @param  row           The row of the given location
+      * @param  friction1     The instantaneous cost (friction) at the neighboring location
+      * @param  cost          The length of the best-known path from a source to the neighboring location


@param neighborCost

pomadchin · 2017-03-22T13:09:06Z

spark/src/main/scala/geotrellis/spark/costdistance/IterativeCostDistance.scala

+
+      costs.count
+      previous.unpersist()
+    } while (accumulator.value.size > 0)


accumulator.value.nonEmpty

pomadchin · 2017-03-22T13:37:01Z

spark/src/main/scala/geotrellis/spark/costdistance/IterativeCostDistance.scala

+    val resolution = computeResolution(friction)
+    logger.debug(s"Computed resolution: $resolution meters/pixel")
+
+    val bounds = friction.metadata.bounds.asInstanceOf[KeyBounds[K]]


Eh, dirty cast here, you can use pattern match on Bounds[K], and to throw Exception (for example) on EmptyBounds

pomadchin · 2017-03-22T13:44:43Z

spark/src/main/scala/geotrellis/spark/costdistance/IterativeCostDistance.scala

+    logger.debug(s"Computed resolution: $resolution meters/pixel")
+
+    val bounds = friction.metadata.bounds.asInstanceOf[KeyBounds[K]]
+    val minKey = implicitly[SpatialKey](bounds.minKey)


instead of the direct implicitly call, it is possible to write smth like:

val KeyBounds(minKey: SpatialKey, maxKey: SpatialKey) = bounds val (minKeyCol, minKeyRow) = minKey

everywhere below can be changed, as there are lots of direct implicitly calls and tuple._{1/2} usages. I'll omit all comments below about the same thing.

pomadchin · 2017-03-22T13:48:04Z

spark/src/main/scala/geotrellis/spark/costdistance/IterativeCostDistance.scala

+
+    // Construct return value and return it
+    val metadata = TileLayerMetadata(DoubleCellType, md.layout, md.extent, md.crs, md.bounds)
+    val rdd = costs.map({ case (k, _, cost) => (k, cost.asInstanceOf[Tile]) })


it's possible just to upcast DoubleArrayTile up to Tile:

(k, cost: Tile)

pomadchin · 2017-03-22T13:56:38Z

spark/src/main/scala/geotrellis/spark/costdistance/IterativeCostDistance.scala

+    val mt = md.mapTransform
+    val kv = friction.first
+    val key = implicitly[SpatialKey](kv._1)
+    val tile = implicitly[Tile](kv._2)


to avoid the direct implicitly call it's possible provide explicit type:

val (key: SpatialKey, tile: Tile) = kv

pomadchin · 2017-03-22T16:22:32Z

Benchmarks. Let me know if that's not enough and i need to test more, but i think it is already obvious that it's significantly slower.

jamesmcclain · 2017-03-22T16:46:27Z

No, I'll just pull the original one out of version control and use that for the single-tile case.

Changes addressed in grisha's review

Signed-off-by: James McClain <[email protected]>

Increase size by one binary order of magnitude. Still proportional to the square root of the area of the typical expected tile.

Force cost values up as quickly as possible to allow paths to be pruned.

This code implements the first iteration of the n-iteration distributed cost-distance algorithm.

05 Feb 19:44:24 INFO [cdistance.CostDistance$] - MILLIS: 155272 05 Feb 19:48:29 INFO [cdistance.CostDistance$] - MILLIS: 153500

Naming this accumulator adds a lot of clutter to the Spark UI.

Previously, only points were accepted.

pomadchin · 2017-03-27T13:28:09Z

raster/src/main/scala/geotrellis/raster/costdistance/SimpleCostDistance.scala

+    val costTile = generateEmptyCostTile(cols, rows)
+    val q: Q = generateEmptyQueue(cols, rows)
+
+    var i = 0; while (i < points.length) {


Why don't to use just cfor? It's a common dep for the whole raster package. A bit less ugly ^^'.

pomadchin · 2017-03-27T13:28:42Z

spark/src/main/scala/geotrellis/spark/costdistance/IterativeCostDistance.scala

+    val keys = mutable.ArrayBuffer.empty[SpatialKey]
+    val bounds = md.layout.mapTransform(g.envelope)
+
+    var row = bounds.rowMin; while (row <= bounds.rowMax) {


Here cfor can be used too

pomadchin · 2017-03-30T20:02:32Z

spark/src/main/scala/geotrellis/spark/costdistance/IterativeCostDistance.scala

+      other
+    }
+    def add(pair: KeyCostPair): Unit = {
+      this.synchronized { list.append(pair) }


I totally forgot about it, but @echeipesh noticed these locks and reminded about it. @jamesmcclain, have you considered any java concurrent collection usage here, instead of synchronized everywhere?

The most appropriate structure that I found was CopyOnWriteArrayList. The documentation for that class says the following:

This is ordinarily too costly, but may be more efficient than alternatives when traversal operations vastly outnumber mutations, and is useful when you cannot or don't want to synchronize traversals, yet need to preclude interference among concurrent threads. The "snapshot" style iterator method uses a reference to the state of the array at the point that the iterator was created. This array never changes during the lifetime of the iterator, so interference is impossible and the iterator is guaranteed not to throw ConcurrentModificationException. The iterator will not reflect additions, removals, or changes to the list since the iterator was created. Element-changing operations on iterators themselves (remove, set, and add) are not supported. These methods throw UnsupportedOperationException.

(I am pointing in particular to the first sentence.)

I do not think that a ~~synchronized~~ concurrent data structure is warranted here since there is only one place for new data to go (the end of the list).

jamesmcclain force-pushed the feature/cost-distance branch from 5f51e4b to afd4dd0 Compare February 2, 2017 14:53

lossyrob suggested changes Feb 6, 2017

View reviewed changes

jamesmcclain changed the title ~~[WiP] Spark-Enabled Cost-Distance~~ Spark-Enabled Cost-Distance Feb 9, 2017

jamesmcclain mentioned this pull request Feb 13, 2017

Cost-Distance Generates Octagonal Features #2013

Closed

lossyrob approved these changes Feb 21, 2017

View reviewed changes

lossyrob previously requested changes Feb 21, 2017

View reviewed changes

jamesmcclain mentioned this pull request Feb 21, 2017

Compute Cost-Distance During Ingest geotrellis/geotrellis-libya-weighted-overlay#4

Merged

jamesmcclain force-pushed the feature/cost-distance branch from 713f6ec to 065398c Compare March 3, 2017 15:19

lossyrob added this to the 1.2 milestone Mar 12, 2017

pomadchin self-assigned this Mar 17, 2017

jamesmcclain force-pushed the feature/cost-distance branch from 065398c to 7116d89 Compare March 20, 2017 00:16

pomadchin requested changes Mar 22, 2017

View reviewed changes

pomadchin assigned jamesmcclain and unassigned pomadchin Mar 22, 2017

jamesmcclain force-pushed the feature/cost-distance branch 4 times, most recently from 2e2307c to 91777a6 Compare March 27, 2017 13:19

James McClain added 5 commits March 27, 2017 09:23

Remove Excess Whitespace

63460f2

Signed-off-by: James McClain <[email protected]>

Simplify CostDistance Implementation

2313427

Bump Initial Priority Queue Size

3f1320f

Increase size by one binary order of magnitude. Still proportional to the square root of the area of the typical expected tile.

Make Thread Safe

cb4a753

Add Capability To Track Edge Changes

177bee2

James McClain added 18 commits March 27, 2017 09:23

Initial Foundation of MrGeo-inspired Cost-Distance

56c353c

Allow Serial Cost-Distance To Terminate

881f16a

Force cost values up as quickly as possible to allow paths to be pruned.

First Iteration

3d0f996

This code implements the first iteration of the n-iteration distributed cost-distance algorithm.

nth Iteration

f3328d0

05 Feb 19:44:24 INFO [cdistance.CostDistance$] - MILLIS: 155272 05 Feb 19:48:29 INFO [cdistance.CostDistance$] - MILLIS: 153500

Loop Fusion

225d8b2

Rename Object

610ab93

Compute Layer Resolution

5c4c9e8

Do Not Name Accumulator

5c716f3

Naming this accumulator adds a lot of clutter to the Spark UI.

Iterative Cost-Distance Unit Tests

0e4de08

Address Not-Fun, Not-Funny Concurrency Bug

3ec6ff1

Add Diagonals

48a8b36

Update Comments

4498423

Add Extension Methods

ec8bc2a

Cost-Distance Extension Method Tests

2e26151

Loosen RDD Type

e26657c

Perform Cost-Distance Over Geometries

640efef

Previously, only points were accepted.

TestEnvironment

94daa42

Restore Original CostDistance.scala

4ed92a2

jamesmcclain force-pushed the feature/cost-distance branch from 91777a6 to b7b3580 Compare March 27, 2017 13:23

Other Changes

30fa62f

jamesmcclain force-pushed the feature/cost-distance branch from b7b3580 to 30fa62f Compare March 27, 2017 13:24

pomadchin reviewed Mar 27, 2017

View reviewed changes

pomadchin approved these changes Mar 27, 2017

View reviewed changes

pomadchin reviewed Mar 30, 2017

View reviewed changes

echeipesh modified the milestones: 1.1, 1.2 Mar 31, 2017

echeipesh merged commit d3943c8 into locationtech:master Mar 31, 2017

jamesmcclain deleted the feature/cost-distance branch March 31, 2017 19:33

lossyrob mentioned this pull request Apr 3, 2017

[BACKPORT] Spark Enabled Cost-Distance #2118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark-Enabled Cost-Distance #1999

Spark-Enabled Cost-Distance #1999

jamesmcclain commented Feb 2, 2017 •

edited

Loading

lossyrob commented Feb 2, 2017

jamesmcclain commented Feb 2, 2017

lossyrob Feb 6, 2017

lossyrob Feb 6, 2017

jamesmcclain Feb 6, 2017

lossyrob Feb 6, 2017

jamesmcclain Feb 6, 2017

jamesmcclain commented Feb 9, 2017

lossyrob left a comment

lossyrob Feb 21, 2017

jamesmcclain Feb 21, 2017 •

edited

Loading

lossyrob Feb 21, 2017

jamesmcclain Feb 21, 2017 •

edited

Loading

jamesmcclain commented Feb 21, 2017

pomadchin left a comment

pomadchin Mar 22, 2017

pomadchin Mar 22, 2017

pomadchin Mar 22, 2017

pomadchin Mar 22, 2017

pomadchin Mar 22, 2017

pomadchin Mar 22, 2017

pomadchin Mar 22, 2017 •

edited

Loading

pomadchin Mar 22, 2017 •

edited

Loading

pomadchin Mar 22, 2017

pomadchin Mar 22, 2017

pomadchin commented Mar 22, 2017

jamesmcclain commented Mar 22, 2017

pomadchin Mar 27, 2017 •

edited

Loading

pomadchin Mar 27, 2017 •

edited

Loading

pomadchin Mar 30, 2017 •

edited

Loading

jamesmcclain Mar 31, 2017 •

edited

Loading

Spark-Enabled Cost-Distance #1999

Spark-Enabled Cost-Distance #1999

Conversation

jamesmcclain commented Feb 2, 2017 • edited Loading

lossyrob commented Feb 2, 2017

jamesmcclain commented Feb 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmcclain commented Feb 9, 2017

lossyrob left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmcclain Feb 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmcclain Feb 21, 2017 • edited Loading

Choose a reason for hiding this comment

jamesmcclain commented Feb 21, 2017

pomadchin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomadchin Mar 22, 2017 • edited Loading

Choose a reason for hiding this comment

pomadchin Mar 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomadchin commented Mar 22, 2017

jamesmcclain commented Mar 22, 2017

pomadchin Mar 27, 2017 • edited Loading

Choose a reason for hiding this comment

pomadchin Mar 27, 2017 • edited Loading

Choose a reason for hiding this comment

pomadchin Mar 30, 2017 • edited Loading

Choose a reason for hiding this comment

jamesmcclain Mar 31, 2017 • edited Loading

Choose a reason for hiding this comment

jamesmcclain commented Feb 2, 2017 •

edited

Loading

jamesmcclain Feb 21, 2017 •

edited

Loading

jamesmcclain Feb 21, 2017 •

edited

Loading

pomadchin Mar 22, 2017 •

edited

Loading

pomadchin Mar 22, 2017 •

edited

Loading

pomadchin Mar 27, 2017 •

edited

Loading

pomadchin Mar 27, 2017 •

edited

Loading

pomadchin Mar 30, 2017 •

edited

Loading

jamesmcclain Mar 31, 2017 •

edited

Loading