Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETL Refactor #1553

Merged
merged 22 commits into from
Aug 19, 2016
Merged

ETL Refactor #1553

merged 22 commits into from
Aug 19, 2016

Conversation

pomadchin
Copy link
Member

@pomadchin pomadchin commented Jun 17, 2016

Move ETL to use json configuration

  • classes to read to / from json
  • refactor etl input / output plugins to work with new inputs
  • update documentation
  • deprecate scallop
  • test on a real ingest (chatta-demo ingest / emr)
  • refactor landsat demo to work with this pr
  • refactor chatta demo to work with this pr
  • output into a separate file
  • improve json validation (our exceptions improvements and schema validation(?))
  • @echeipesh comments

@pomadchin
Copy link
Member Author

pomadchin commented Jun 17, 2016

//complete json example//

datasets.json:

[  
   {  
      "name":"test",
      "ingestType":{  
         "format":"geotiff",
         "inputCredentials":"inputCredentials name",
         "output":"hadoop",
         "outputCredentials":"outputCredentials name",
         "input":"hadoop"
      },
      "path":{  
         "input":"input",
         "output":"output"
      },
      "cache":"NONE",
      "ingestOptions":{  
         "breaks":"0:ffffe5ff;0.1:f7fcb9ff;0.2:d9f0a3ff;0.3:addd8eff;0.4:78c679ff;0.5:41ab5dff;0.6:238443ff;0.7:006837ff;1:004529ff",
         "reprojectMethod":"buffered",
         "cellSize":{  
            "width":256.0,
            "height":256.0
         },
         "encoding":"geotiff",
         "tileSize":256,
         "layoutExtent":{  
            "xmin":1.0,
            "ymin":2.0,
            "xmax":3.0,
            "ymax":4.0
         },
         "resolutionThreshold":0.1,
         "pyramid":true,
         "resampleMethod":"nearest-neighbor",
         "keyIndexMethod":{  
            "type":"zorder"
         },
         "layoutScheme":"tms",
         "cellType":"int8",
         "crs":"+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs "
      }
   }
]

credentials.json:

{
  "accumulo": [{
    "name": "name",
    "zookeepers": "zookeepers",
    "instance": "instance",
    "user": "user",
    "password": "password"
  }],
  "cassandra": [{
    "name": "name",
    "allowRemoteDCsForLocalConsistencyLevel": false,
    "localDc": "datacenter1",
    "usedHostsPerRemoteDc": 0,
    "hosts": "hosts",
    "replicationStrategy": "SimpleStrategy",
    "replicationFactor": 1,
    "user": "user",
    "password": "password"
  }],
  "s3": [],
  "hadoop": []
}

@lossyrob
Copy link
Member

Does this allow for additional backends to be supplied via code in client projects?

@pomadchin pomadchin changed the title [WIP] ETL Refactor ETL Refactor Jun 20, 2016
@pomadchin
Copy link
Member Author

should include #1494

@pomadchin pomadchin changed the title ETL Refactor [WIP] ETL Refactor Jun 22, 2016
@RickMohr
Copy link

RickMohr commented Jun 22, 2016

Will there be validation for the input JSON? If I mis-spell resampleMetod by mistake it would be fantastic to get an "Unknown property - resampleMetod" error rather than a silent baffling result.

(There are JSON schema validators for most languages, listed at http://json-schema.org/implementations.html. I don't see one for Scala, but for Java https://github.com/everit-org/json-schema looks active.)

@pomadchin
Copy link
Member Author

@RickMohr a good point, sure, more over our json validation should be improved too. thx for that notice

--layer nlcd-tms --crs EPSG:3857 --pyramid --layoutScheme {tms | floating}
### Ingest tiles from local fs or hdfs into s3 storage command

`datasets.json`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have input dataset and output config to be in separate files? This way you can have multiple sources and only one destination like --output azavea-datahub.json

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echeipesh a very good point

@pomadchin
Copy link
Member Author

@RickMohr have you got any comments? (i decided to use https://github.com/daveclayton/json-schema-validator, looks like it has larger community and supported right now)

@pomadchin pomadchin changed the title [WIP] ETL Refactor ETL Refactor Jul 8, 2016
import spray.json.DefaultJsonProtocol._
import spray.json._

class EtlConf(val credentials: Credentials, val datasets: List[Input], val output: Output) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a cool idea to be able to have multiple inputs into an ETL job but EtlJob only takes one input, so it doesn't seem like having a list of them here is used?

@echeipesh
Copy link
Contributor

Ok, my biggest comment is on handling of credentials.json. In reality it serves two related purposes, to provide credentials and to provide some back-end specific configuration, that's great. We just need to make it a lookup table that we can call out to from the input.json Currently it calls out the type of the back-end as part of the JSON structure, which is kind of strange. Ideally this is a name -> config table like;

{
  "backend-profiles": [
    {
      "name": "accumulo-gis",
      "type": "Accumulo",
      "zookeepers": "zookeepers",
      "instance": "instance",
      "user": "user",
      "password": "password"
    },
    {
      "name": "cassandra-gis",
      "type": "Cassandra"
      "allowRemoteDCsForLocalConsistencyLevel": false,
      "localDc": "datacenter1",
      "usedHostsPerRemoteDc": 0,
      "hosts": "hosts",
      "replicationStrategy": "SimpleStrategy",
      "replicationFactor": 1,
      "user": "user",
      "password": "password"
    }
  ]
}

Now related problem is that an input needs to be able to refine those configurations by adding some kind of path/table/schema selection and configure an InputPlugin that will actually fetch me some bytes. It can look like:
input.json:

   {
      "name":"nlcd-tms",
      "format":"{geotiff | temporal-geotiff}",
      "backend":{
        "type":"file"
        "profile": "datahub-local",
        "path": "file:///Data/nlcd/tiles",
      },
      "NoData": 0,
      "cache":"NONE",
   }

Here we're going to ask for everything that affects reading the tiles: what format they're in, if we should cache them, if we should force a NoData while while reading them (because nobody cares to set it correctly when writing the files) and most importantly the backend to use. The backend js object will select the credentials/config profile and add some information that identifies a dataset in there.

I think you could get a little fancy by basically merging the profile with the backend block, which would allow you to shift the configuration from the profile into the input section, but lets say that's not justified here.

Then final comment is that the output is actually what determines the index and layout used, so those sections need to be moved there:

{
  "backend": {
      "type": "S3",
      "profile": "datahub-s3"
      "prefix": "/catalog"
  },
  "reprojectMethod":"buffered",
  "pyramid":true,
  "resampleMethod":"nearest-neighbor",
  "keyIndexMethod":{
    "type":"zorder"
  },
  "layoutScheme": { // just map the fields directly to zoomed and floating layout scheme objects
     "type": "zoomed",
     "tileSize": 256,
     "crs":"EPSG:3857"
  },
}

Hopefully by dealing with backend profiles this way you can even generate a case class that you can pass to the input module like AccumuloBackend that will have all the fields set. You could parse them by partially reading the JSON object to read the "type" and then using the correct format to read it fully and shove them into a Map[String, Backend] or something like that.

I like the idea of the CustomBackend you have and that fits pretty nicely into this flow as well.

"instance",
"user",
"password"
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding

"additionalProperties": false

(Explained e.g. at https://spacetelescope.github.io/understanding-json-schema/reference/object.html)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RickMohr probably description field would be good to add as well, thanks!

@RickMohr
Copy link

RickMohr commented Jul 8, 2016

Thanks for adding validation!

@lossyrob
Copy link
Member

@pomadchin can you update? I think we're ready for this to be merged, yeah?

@pomadchin
Copy link
Member Author

@lossyrob yes, sure

@pomadchin
Copy link
Member Author

merged master, though need to tests on any demo, how / if hbase modules work


sealed trait BackendProfile {
val name: String
def `type`: String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can actually be BackendType instead of a string

@echeipesh
Copy link
Contributor

Most of the comments in round two are around making a Backend type hierarchy to clean up all those match statements and upgrading Strings to instances of objects they name.

A really cool thing this allows is getting rid of the params all-together. Plugins could get just instances of Backend and BackendProfile and cast them to the type they must be. This would get rid of all the string parsing. The whole params: Maps[String, String] is an artifact parsing command line arguments that don't really have a deep structure that JSON does. This is a lot of code changes so probably it can be done as round 2 PR into ETL.


import geotrellis.spark.etl.config._

case class EtlJob(conf: EtlConf) extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once backends generate their own params, what is the purpose of this class? We could just pass EtlConf at that point, but does that lose meaning?

@pomadchin
Copy link
Member Author

includes rather important hbase deps fix ("org.codehaus.jackson" % "jackson-core-asl" % "1.8.3" intransitive())

@pomadchin
Copy link
Member Author

pomadchin commented Aug 12, 2016

without last dirty fix:

Exception in thread "main" java.lang.ClassCastException: geotrellis.spark.io.index.ZCurveKeyIndexMethod$ cannot be cast to geotrellis.spark.io.index.KeyIndexMethod
    at geotrellis.spark.etl.config.Output.getKeyIndexMethod(Output.scala:45)
    at geotrellis.spark.etl.accumulo.SpatialAccumuloOutput.writer(SpatialAccumuloOutput.scala:13)
    at geotrellis.spark.etl.OutputPlugin$class.apply(OutputPlugin.scala:19)
    at geotrellis.spark.etl.accumulo.SpatialAccumuloOutput.apply(SpatialAccumuloOutput.scala:11)
    at geotrellis.spark.etl.Etl.savePyramid$1(Etl.scala:178)
    at geotrellis.spark.etl.Etl.save(Etl.scala:192)
    at geotrellis.spark.etl.Etl$$anonfun$ingest$1.apply(Etl.scala:47)
    at geotrellis.spark.etl.Etl$$anonfun$ingest$1.apply(Etl.scala:39)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at geotrellis.spark.etl.Etl$.ingest(Etl.scala:39)
    at geotrellis.chatta.ChattaIngest$delayedInit$body.apply(ChattaIngest.scala:15)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
    at scala.App$class.main(App.scala:71)
    at geotrellis.chatta.ChattaIngest$.main(ChattaIngest.scala:12)
    at geotrellis.chatta.ChattaIngest.main(ChattaIngest.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@echeipesh
Copy link
Contributor

💯 Thank you for pushing this for so long, looks really good.

@echeipesh echeipesh merged commit 2077839 into locationtech:master Aug 19, 2016
@lossyrob lossyrob added this to the 1.0 milestone Oct 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants