From 55b50a892cb6d6560a9816dc4192e80e2f24e600 Mon Sep 17 00:00:00 2001 From: Sriram Keerthi Madhava Kunjathur Date: Fri, 15 Apr 2016 21:14:12 +0530 Subject: [PATCH] Updating README.md --- README.md | 79 +++++++++++++++++++++++++++++++------------------------ 1 file changed, 44 insertions(+), 35 deletions(-) diff --git a/README.md b/README.md index 000bd9d..1eb1025 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,50 @@ Simple MongoRDD to read data from MongoDB into Spark ## Build Status ![MongoRDD Travis-CI Build Status](https://travis-ci.org/caffinc/MongoRDD.svg?branch=master) +## Usage + +MongoRDD is on Bintray and Maven Central: + + + com.caffinc.sparktools + mongordd + 1.0.1 + + +MongoRDD extends the Spark RDD class and provides a way to read from MongoDB directly into Spark. + +Assume the following constants: + + val sc = new SparkContext(conf) + val mongoClientUri = "mongodb://localhost:27017" + val database = "DBName" + val collection = "CollectionName" + val query = new Document(...) + val partitions = 4 + +**Usage in Scala:** + + new MongoRDD(sc, mongoClientUri, database, collection, query, partitions).map(...) + +**Usage in Java:** + + new JavaRDD<>( + new MongoRDD(sc, mongoClientUri, database, collection, query, partitions), + ClassManifestFactory$.MODULE$.fromClass(Document.class) + ).map(...) + +[MongoClientURI](https://docs.mongodb.org/manual/reference/connection-string/ "Mongo Connection String") is one of the simplest ways to connect to a MongoDB instance. Feel free to extend this, and raise a Pull Request if you think it should be included in this repo. + +## Tests + +There is just one extensive test, which launches an embedded MongoDB instance and writes dummy values into it and tests the MongoRDD on a local Spark instance. The test Works on my Machineā„¢ and Travis-CI (Which is awesome!). + +It might not work on your machine for the following reasons: + +* It uses an [Embedded MongoDB](https://github.com/flapdoodle-oss/de.flapdoodle.embed.mongo) instance, which requires several megabytes of download the first time it runs. This might be slow, and the test might timeout. Comment out the line which makes the `setUp()` fail on slow starts and try it out. +* You might have an older version of Spark in your dependencies which might have a bug while running on Windows. Are you able to run Spark for other stuff without issues? +* You're channeling evil spirits which don't like MongoDB. Pray to your God and hope for the best, or send me an email (admin@caffinc.com) if you think I can help :) + ## Dependencies These are not absolute, but are current (probably) as of 3rd March, 2016. It should be trivial to upgrade or downgrade versions as required. @@ -38,47 +82,12 @@ These are not absolute, but are current (probably) as of 3rd March, 2016. It sho - -## Usage - -MongoRDD extends the Spark RDD class and provides a way to read from MongoDB directly into Spark. - -Usage in Scala: - - val sc = new SparkContext(conf) - val mongoClientUri = "mongodb://localhost:27017" - val database = "DBName" - val collection = "CollectionName" - val query = new Document(...) - val partitions = 4 - new MongoRDD(sc, mongoClientUri, database, collection, query, partitions).map(...) - -Usage in Java (Assume constants are the same): - - new JavaRDD<>( - new MongoRDD(sc, mongoClientUri, database, collection, query, partitions), - ClassManifestFactory$.MODULE$.fromClass(Document.class) - ).map(...) - -[MongoClientURI](https://docs.mongodb.org/manual/reference/connection-string/ "Mongo Connection String") is one of the simplest ways to connect to a MongoDB instance. Feel free to extend this, and raise a Pull Request if you think it should be included in this repo. - -## Tests - -There is just one extensive test, which launches an embedded MongoDB instance and writes dummy values into it and tests the MongoRDD on a local Spark instance. The test Works on my Machineā„¢ and Travis-CI (Which is awesome!). - -It might not work on your machine for the following reasons: - -* It uses an [Embedded MongoDB](https://github.com/flapdoodle-oss/de.flapdoodle.embed.mongo) instance, which requires several megabytes of download the first time it runs. This might be slow, and the test might timeout. Comment out the line which makes the `setUp()` fail on slow starts and try it out. -* You might have an older version of Spark in your dependencies which might have a bug while running on Windows. Are you able to run Spark for other stuff without issues? -* You're channeling evil spirits which don't like MongoDB. Pray to your God and hope for the best, or send me an email (admin@caffinc.com) if you think I can help :) - ## Ideas There are a few things that can be done to extend this: * Provide other means of connecting to MongoDB * Make the RDD generic, and provide an interface to convert BSON documents to other formats before returning (This can be achieved with a simple call to `map()` so it wasn't done) -* Make it available in Maven Central or Bintray * Add more tests If you can help with one or more of the above, or if you have suggestions of your own, send me an email or raise a PR and I will review it and add it.