Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Behavior When Reading GeoTiffs From S3 #2307

Closed
jamesmcclain opened this issue Aug 4, 2017 · 2 comments
Closed

Unexpected Behavior When Reading GeoTiffs From S3 #2307

jamesmcclain opened this issue Aug 4, 2017 · 2 comments
Assignees
Milestone

Comments

@jamesmcclain
Copy link
Member

jamesmcclain commented Aug 4, 2017

I encountered this in GeoPySpark, but the error was generated from within GeoTrellis. (Note that this is not a duplicate of #2308.)

When trying to read a GeoTiff from S3, an error (reproduced in full below) was generated that said requirement failed: Either PARTITION_COUNT or PARTITION_SIZE option may be set.

Typing something like this into a GeoNotebook session

uri = 's3://datahub-rawdata-us-east-1/cdl/CDLS_2016_30m.tif'
raster_layer = gps.geotiff.get(uri=uri, layer_type=gps.LayerType.SPATIAL, max_tile_size=512, num_partitions=32)
raster_layer.count()

yields an error like this

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-11-d19aaf6b32a2> in <module>()
----> 1 raster_layer.count()

/home/hadoop/.local/lib/python3.4/site-packages/geopyspark/geotrellis/layer.py in count(self)
    197         """
    198 
--> 199         return self.srdd.rdd().count()
    200 
    201 

/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o34.count.
: java.lang.IllegalArgumentException: requirement failed: Either PARTITION_COUNT or PARTITION_SIZE option may be set
	at scala.Predef$.require(Predef.scala:224)
	at geotrellis.spark.io.s3.S3InputFormat.getSplits(S3InputFormat.scala:81)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:125)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91)
	at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:91)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.dependencies(RDD.scala:237)
	at org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:274)
	at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:274)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:273)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1551)
	at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1525)
	at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1691)
	at org.apache.spark.rdd.DefaultPartitionCoalescer.currPrefLocs(CoalescedRDD.scala:178)
	at org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations$$anonfun$getAllPrefLocs$2.apply(CoalescedRDD.scala:196)
	at org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations$$anonfun$getAllPrefLocs$2.apply(CoalescedRDD.scala:195)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.getAllPrefLocs(CoalescedRDD.scala:195)
	at org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.<init>(CoalescedRDD.scala:188)
	at org.apache.spark.rdd.DefaultPartitionCoalescer.coalesce(CoalescedRDD.scala:391)
	at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:91)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
	at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
@lossyrob
Copy link
Member

lossyrob commented Aug 6, 2017

Broken by this change: 18c3176#diff-faf1173845ec6fd8a8d1fa4e6c447002R69

When a numPartitions option is given, it breaks this condition:

require(null == partitionCountConf || null == partitionSizeConf,
"Either PARTITION_COUNT or PARTITION_SIZE option may be set")

Why did we default the partition bytes?

@lossyrob lossyrob added this to the 1.2 milestone Aug 6, 2017
@pomadchin
Copy link
Member

pomadchin commented Aug 6, 2017

To use better repartitioning strategies:
#2296 (comment)

I'll try to reproduce this issue.
https://github.com/locationtech/geotrellis/blob/master/s3/src/main/scala/geotrellis/spark/io/s3/S3GeoTiffRDD.scala#L117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants