Skip to content

Commit

Permalink
Updating README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Sriram Keerthi Madhava Kunjathur authored and Sriram Keerthi Madhava Kunjathur committed Apr 15, 2016
1 parent 67604c8 commit 55b50a8
Showing 1 changed file with 44 additions and 35 deletions.
79 changes: 44 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,50 @@ Simple MongoRDD to read data from MongoDB into Spark
## Build Status
![MongoRDD Travis-CI Build Status](https://travis-ci.org/caffinc/MongoRDD.svg?branch=master)

## Usage

MongoRDD is on Bintray and Maven Central:

<dependency>
<groupId>com.caffinc.sparktools</groupId>
<artifactId>mongordd</artifactId>
<version>1.0.1</version>
</dependency>

MongoRDD extends the Spark RDD class and provides a way to read from MongoDB directly into Spark.

Assume the following constants:

val sc = new SparkContext(conf)
val mongoClientUri = "mongodb://localhost:27017"
val database = "DBName"
val collection = "CollectionName"
val query = new Document(...)
val partitions = 4

**Usage in Scala:**

new MongoRDD(sc, mongoClientUri, database, collection, query, partitions).map(...)

**Usage in Java:**

new JavaRDD<>(
new MongoRDD(sc, mongoClientUri, database, collection, query, partitions),
ClassManifestFactory$.MODULE$.fromClass(Document.class)
).map(...)

[MongoClientURI](https://docs.mongodb.org/manual/reference/connection-string/ "Mongo Connection String") is one of the simplest ways to connect to a MongoDB instance. Feel free to extend this, and raise a Pull Request if you think it should be included in this repo.

## Tests

There is just one extensive test, which launches an embedded MongoDB instance and writes dummy values into it and tests the MongoRDD on a local Spark instance. The test Works on my Machine™ and Travis-CI (Which is awesome!).

It might not work on your machine for the following reasons:

* It uses an [Embedded MongoDB](https://github.com/flapdoodle-oss/de.flapdoodle.embed.mongo) instance, which requires several megabytes of download the first time it runs. This might be slow, and the test might timeout. Comment out the line which makes the `setUp()` fail on slow starts and try it out.
* You might have an older version of Spark in your dependencies which might have a bug while running on Windows. Are you able to run Spark for other stuff without issues?
* You're channeling evil spirits which don't like MongoDB. Pray to your God and hope for the best, or send me an email ([email protected]) if you think I can help :)

## Dependencies

These are not absolute, but are current (probably) as of 3rd March, 2016. It should be trivial to upgrade or downgrade versions as required.
Expand Down Expand Up @@ -38,47 +82,12 @@ These are not absolute, but are current (probably) as of 3rd March, 2016. It sho
</dependency>
</dependencies>


## Usage

MongoRDD extends the Spark RDD class and provides a way to read from MongoDB directly into Spark.

Usage in Scala:

val sc = new SparkContext(conf)
val mongoClientUri = "mongodb://localhost:27017"
val database = "DBName"
val collection = "CollectionName"
val query = new Document(...)
val partitions = 4
new MongoRDD(sc, mongoClientUri, database, collection, query, partitions).map(...)

Usage in Java (Assume constants are the same):

new JavaRDD<>(
new MongoRDD(sc, mongoClientUri, database, collection, query, partitions),
ClassManifestFactory$.MODULE$.fromClass(Document.class)
).map(...)

[MongoClientURI](https://docs.mongodb.org/manual/reference/connection-string/ "Mongo Connection String") is one of the simplest ways to connect to a MongoDB instance. Feel free to extend this, and raise a Pull Request if you think it should be included in this repo.

## Tests

There is just one extensive test, which launches an embedded MongoDB instance and writes dummy values into it and tests the MongoRDD on a local Spark instance. The test Works on my Machine™ and Travis-CI (Which is awesome!).

It might not work on your machine for the following reasons:

* It uses an [Embedded MongoDB](https://github.com/flapdoodle-oss/de.flapdoodle.embed.mongo) instance, which requires several megabytes of download the first time it runs. This might be slow, and the test might timeout. Comment out the line which makes the `setUp()` fail on slow starts and try it out.
* You might have an older version of Spark in your dependencies which might have a bug while running on Windows. Are you able to run Spark for other stuff without issues?
* You're channeling evil spirits which don't like MongoDB. Pray to your God and hope for the best, or send me an email ([email protected]) if you think I can help :)

## Ideas

There are a few things that can be done to extend this:

* Provide other means of connecting to MongoDB
* Make the RDD generic, and provide an interface to convert BSON documents to other formats before returning (This can be achieved with a simple call to `map()` so it wasn't done)
* Make it available in Maven Central or Bintray
* Add more tests

If you can help with one or more of the above, or if you have suggestions of your own, send me an email or raise a PR and I will review it and add it.
Expand Down

0 comments on commit 55b50a8

Please sign in to comment.