Mortar is a platform-as-a-service for Hadoop. With Mortar, you can run jobs on Hadoop using Apache Pig and Python without any special training.
Here we've included some example scripts that explore using MongoDB with Hadoop. These scripts use a sample of tweets from a single day loaded into a readonly publicly available MongoDB instance. To start using them:
-
Clone this repository to your computer and register it as a project with Mortar:
git clone [email protected]:mortardata/mongo-pig-examples.git cd mongo-pig-examples mortar register mongo-pig-examples
Once you've setup the project, use the mortar illustrate
command to show data flowing through a given script. Use mortar run
to run the script on a Hadoop cluster.
For lots more help and tutorials on running Mortar, check out the Mortar Help site.
This pig script will return some basic information about a MongoDB collection. Output is:
- Field Name. Embedded fields have their parent's field name prepended to their name. Every field that appears in any document in the collection is listed.
- Unique value count. The number of unique values associated with the field.
- Example value. An example value for the field.
- Example value type. The data type of the example value.
- Value count. The number of times the example value appeared for this field in the collection
Each field is listed up to five times with their five most common example values.
This pig script will return a single text field which is the Pig schema of the collection loaded. This schema can be copied directly into the MongoLoader constructor to load the collection. See Using MongoDB with Mortar for an explanation of why you might like to load your collection using a schema.
This pig script will go through a small sampling of a single day's worth of Tweets and count the number of times coffee was tweeted bucketed into two hour time blocks of the tweeter's local time.