-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Sriram Keerthi Madhava Kunjathur
authored and
Sriram Keerthi Madhava Kunjathur
committed
Apr 15, 2016
1 parent
67604c8
commit 55b50a8
Showing
1 changed file
with
44 additions
and
35 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,50 @@ Simple MongoRDD to read data from MongoDB into Spark | |
## Build Status | ||
![MongoRDD Travis-CI Build Status](https://travis-ci.org/caffinc/MongoRDD.svg?branch=master) | ||
|
||
## Usage | ||
|
||
MongoRDD is on Bintray and Maven Central: | ||
|
||
<dependency> | ||
<groupId>com.caffinc.sparktools</groupId> | ||
<artifactId>mongordd</artifactId> | ||
<version>1.0.1</version> | ||
</dependency> | ||
|
||
MongoRDD extends the Spark RDD class and provides a way to read from MongoDB directly into Spark. | ||
|
||
Assume the following constants: | ||
|
||
val sc = new SparkContext(conf) | ||
val mongoClientUri = "mongodb://localhost:27017" | ||
val database = "DBName" | ||
val collection = "CollectionName" | ||
val query = new Document(...) | ||
val partitions = 4 | ||
|
||
**Usage in Scala:** | ||
|
||
new MongoRDD(sc, mongoClientUri, database, collection, query, partitions).map(...) | ||
|
||
**Usage in Java:** | ||
|
||
new JavaRDD<>( | ||
new MongoRDD(sc, mongoClientUri, database, collection, query, partitions), | ||
ClassManifestFactory$.MODULE$.fromClass(Document.class) | ||
).map(...) | ||
|
||
[MongoClientURI](https://docs.mongodb.org/manual/reference/connection-string/ "Mongo Connection String") is one of the simplest ways to connect to a MongoDB instance. Feel free to extend this, and raise a Pull Request if you think it should be included in this repo. | ||
|
||
## Tests | ||
|
||
There is just one extensive test, which launches an embedded MongoDB instance and writes dummy values into it and tests the MongoRDD on a local Spark instance. The test Works on my Machine™ and Travis-CI (Which is awesome!). | ||
|
||
It might not work on your machine for the following reasons: | ||
|
||
* It uses an [Embedded MongoDB](https://github.com/flapdoodle-oss/de.flapdoodle.embed.mongo) instance, which requires several megabytes of download the first time it runs. This might be slow, and the test might timeout. Comment out the line which makes the `setUp()` fail on slow starts and try it out. | ||
* You might have an older version of Spark in your dependencies which might have a bug while running on Windows. Are you able to run Spark for other stuff without issues? | ||
* You're channeling evil spirits which don't like MongoDB. Pray to your God and hope for the best, or send me an email ([email protected]) if you think I can help :) | ||
|
||
## Dependencies | ||
|
||
These are not absolute, but are current (probably) as of 3rd March, 2016. It should be trivial to upgrade or downgrade versions as required. | ||
|
@@ -38,47 +82,12 @@ These are not absolute, but are current (probably) as of 3rd March, 2016. It sho | |
</dependency> | ||
</dependencies> | ||
|
||
|
||
## Usage | ||
|
||
MongoRDD extends the Spark RDD class and provides a way to read from MongoDB directly into Spark. | ||
|
||
Usage in Scala: | ||
|
||
val sc = new SparkContext(conf) | ||
val mongoClientUri = "mongodb://localhost:27017" | ||
val database = "DBName" | ||
val collection = "CollectionName" | ||
val query = new Document(...) | ||
val partitions = 4 | ||
new MongoRDD(sc, mongoClientUri, database, collection, query, partitions).map(...) | ||
|
||
Usage in Java (Assume constants are the same): | ||
|
||
new JavaRDD<>( | ||
new MongoRDD(sc, mongoClientUri, database, collection, query, partitions), | ||
ClassManifestFactory$.MODULE$.fromClass(Document.class) | ||
).map(...) | ||
|
||
[MongoClientURI](https://docs.mongodb.org/manual/reference/connection-string/ "Mongo Connection String") is one of the simplest ways to connect to a MongoDB instance. Feel free to extend this, and raise a Pull Request if you think it should be included in this repo. | ||
|
||
## Tests | ||
|
||
There is just one extensive test, which launches an embedded MongoDB instance and writes dummy values into it and tests the MongoRDD on a local Spark instance. The test Works on my Machine™ and Travis-CI (Which is awesome!). | ||
|
||
It might not work on your machine for the following reasons: | ||
|
||
* It uses an [Embedded MongoDB](https://github.com/flapdoodle-oss/de.flapdoodle.embed.mongo) instance, which requires several megabytes of download the first time it runs. This might be slow, and the test might timeout. Comment out the line which makes the `setUp()` fail on slow starts and try it out. | ||
* You might have an older version of Spark in your dependencies which might have a bug while running on Windows. Are you able to run Spark for other stuff without issues? | ||
* You're channeling evil spirits which don't like MongoDB. Pray to your God and hope for the best, or send me an email ([email protected]) if you think I can help :) | ||
|
||
## Ideas | ||
|
||
There are a few things that can be done to extend this: | ||
|
||
* Provide other means of connecting to MongoDB | ||
* Make the RDD generic, and provide an interface to convert BSON documents to other formats before returning (This can be achieved with a simple call to `map()` so it wasn't done) | ||
* Make it available in Maven Central or Bintray | ||
* Add more tests | ||
|
||
If you can help with one or more of the above, or if you have suggestions of your own, send me an email or raise a PR and I will review it and add it. | ||
|