-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-938 - Openstack Swift object storage support #1010
Closed
Closed
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
b6c37ef
Openstack Swift support
gilv ce483d7
SPARK-938 - Openstack Swift object storage support
gilv eff538d
SPARK-938 - Openstack Swift object storage support
gilv 9b625b5
Merge branch 'master' of https://github.com/gilv/spark
gilv 2aba763
Fix to docs/openstack-integration.md
gilv c977658
Merge branch 'master' of https://github.com/gilv/spark
gilv 39a9737
Spark integration with Openstack Swift
gilv eb22295
Merge pull request #1010 from gilv/master
rxin 99f095d
Pending openstack changes.
rxin cca7192
Removed white spases from pom.xml
gilv 47ce99d
Merge branch 'master' into openstack
rxin ac0679e
Fixed an unclosed tr.
rxin 6994827
Merge pull request #1 from rxin/openstack
gilv 9233fef
Fixed typos
gilv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,269 @@ | ||
--- | ||
layout: global | ||
title: OpenStack Integration | ||
--- | ||
|
||
* This will become a table of contents (this text will be scraped). | ||
{:toc} | ||
|
||
|
||
# Accessing OpenStack Swift from Spark | ||
|
||
Spark's file interface allows it to process data in OpenStack Swift using the same URI | ||
formats that are supported for Hadoop. You can specify a path in Swift as input through a | ||
URI of the form <code>swift://<container.PROVIDER/path</code>. You will also need to set your | ||
Swift security credentials, through <code>core-sites.xml</code> or via | ||
<code>SparkContext.hadoopConfiguration</code>. | ||
Openstack Swift driver was merged in Hadoop version 2.3.0 | ||
([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). | ||
Users that wish to use previous Hadoop versions will need to configure Swift driver manually. | ||
Current Swift driver requires Swift to use Keystone authentication method. There are recent efforts | ||
to support temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). | ||
|
||
# Configuring Swift | ||
Proxy server of Swift should include <code>list_endpoints</code> middleware. More information | ||
available | ||
[here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) | ||
|
||
# Dependencies | ||
|
||
Spark should be compiled with <code>hadoop-openstack-2.3.0.jar</code> that is distributted with | ||
Hadoop 2.3.0. For the Maven builds, the <code>dependencyManagement</code> section of Spark's main | ||
<code>pom.xml</code> should include: | ||
{% highlight xml %} | ||
<dependencyManagement> | ||
... | ||
<dependency> | ||
<groupId>org.apache.hadoop</groupId> | ||
<artifactId>hadoop-openstack</artifactId> | ||
<version>2.3.0</version> | ||
</dependency> | ||
... | ||
</dependencyManagement> | ||
{% endhighlight %} | ||
|
||
In addition, both <code>core</code> and <code>yarn</code> projects should add | ||
<code>hadoop-openstack</code> to the <code>dependencies</code> section of their | ||
<code>pom.xml</code>: | ||
{% highlight xml %} | ||
<dependencies> | ||
... | ||
<dependency> | ||
<groupId>org.apache.hadoop</groupId> | ||
<artifactId>hadoop-openstack</artifactId> | ||
</dependency> | ||
... | ||
</dependencies> | ||
{% endhighlight %} | ||
|
||
# Configuration Parameters | ||
|
||
Create <code>core-sites.xml</code> and place it inside <code>/spark/conf</code> directory. | ||
There are two main categories of parameters that should to be configured: declaration of the | ||
Swift driver and the parameters that are required by Keystone. | ||
|
||
Configuration of Hadoop to use Swift File system achieved via | ||
|
||
<table class="table"> | ||
<tr><th>Property Name</th><th>Value</th></tr> | ||
<tr> | ||
<td>fs.swift.impl</td> | ||
<td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td> | ||
</tr> | ||
</table> | ||
|
||
Additional parameters required by Keystone and should be provided to the Swift driver. Those | ||
parameters will be used to perform authentication in Keystone to access Swift. The following table | ||
contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be any name. | ||
|
||
<table class="table"> | ||
<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr> | ||
<tr> | ||
<td><code>fs.swift.service.PROVIDER.auth.url</code></td> | ||
<td>Keystone Authentication URL</td> | ||
<td>Mandatory</td> | ||
</tr> | ||
<tr> | ||
<td><code>fs.swift.service.PROVIDER.auth.endpoint.prefix</code></td> | ||
<td>Keystone endpoints prefix</td> | ||
<td>Optional</td> | ||
</tr> | ||
<tr> | ||
<td><code>fs.swift.service.PROVIDER.tenant</code></td> | ||
<td>Tenant</td> | ||
<td>Mandatory</td> | ||
</tr> | ||
<tr> | ||
<td><code>fs.swift.service.PROVIDER.username</code></td> | ||
<td>Username</td> | ||
<td>Mandatory</td> | ||
</tr> | ||
<tr> | ||
<td><code>fs.swift.service.PROVIDER.password</code></td> | ||
<td>Password</td> | ||
<td>Mandatory</td> | ||
</tr> | ||
<tr> | ||
<td><code>fs.swift.service.PROVIDER.http.port</code></td> | ||
<td>HTTP port</td> | ||
<td>Mandatory</td> | ||
</tr> | ||
<tr> | ||
<td><code>fs.swift.service.PROVIDER.region</code></td> | ||
<td>Keystone region</td> | ||
<td>Mandatory</td> | ||
</tr> | ||
<tr> | ||
<td><code>fs.swift.service.PROVIDER.public</code></td> | ||
<td>Indicates if all URLs are public</td> | ||
<td>Mandatory</td> | ||
</tr> | ||
</table> | ||
|
||
For example, assume <code>PROVIDER=SparkTest</code> and Keystone contains user <code>tester</code> with password <code>testing</code> | ||
defined for tenant <code>tenant</code>. Than <code>core-sites.xml</code> should include: | ||
|
||
{% highlight xml %} | ||
<configuration> | ||
<property> | ||
<name>fs.swift.impl</name> | ||
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.SparkTest.auth.url</name> | ||
<value>http://127.0.0.1:5000/v2.0/tokens</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.SparkTest.auth.endpoint.prefix</name> | ||
<value>endpoints</value> | ||
</property> | ||
<name>fs.swift.service.SparkTest.http.port</name> | ||
<value>8080</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.SparkTest.region</name> | ||
<value>RegionOne</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.SparkTest.public</name> | ||
<value>true</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.SparkTest.tenant</name> | ||
<value>test</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.SparkTest.username</name> | ||
<value>tester</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.SparkTest.password</name> | ||
<value>testing</value> | ||
</property> | ||
</configuration> | ||
{% endhighlight %} | ||
|
||
Notice that | ||
<code>fs.swift.service.PROVIDER.tenant</code>, | ||
<code>fs.swift.service.PROVIDER.username</code>, | ||
<code>fs.swift.service.PROVIDER.password</code> contains sensitive information and keeping them in | ||
<code>core-sites.xml</code> is not always a good approach. | ||
We suggest to keep those parameters in <code>core-sites.xml</code> for testing purposes when running Spark | ||
via <code>spark-shell</code>. | ||
For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>. | ||
|
||
# Usage examples | ||
|
||
Assume Keystone's authentication URL is <code>http://127.0.0.1:5000/v2.0/tokens</code> and Keystone contains tenant <code>test</code>, user <code>tester</code> with password <code>testing</code>. In our example we define <code>PROVIDER=SparkTest</code>. Assume that Swift contains container <code>logs</code> with an object <code>data.log</code>. To access <code>data.log</code> from Spark the <code>swift://</code> scheme should be used. | ||
|
||
|
||
## Running Spark via spark-shell | ||
|
||
Make sure that <code>core-sites.xml</code> contains <code>fs.swift.service.SparkTest.tenant</code>, <code>fs.swift.service.SparkTest.username</code>, | ||
<code>fs.swift.service.SparkTest.password</code>. Run Spark via <code>spark-shell</code> and access Swift via <code>swift://</code> scheme. | ||
|
||
{% highlight scala %} | ||
val sfdata = sc.textFile("swift://logs.SparkTest/data.log") | ||
sfdata.count() | ||
{% endhighlight %} | ||
|
||
|
||
## Sample Application | ||
|
||
In this case <code>core-sites.xml</code> need not contain <code>fs.swift.service.SparkTest.tenant</code>, <code>fs.swift.service.SparkTest.username</code>, | ||
<code>fs.swift.service.SparkTest.password</code>. Example of Java usage: | ||
|
||
{% highlight java %} | ||
/* SimpleApp.java */ | ||
import org.apache.spark.api.java.*; | ||
import org.apache.spark.SparkConf; | ||
import org.apache.spark.api.java.function.Function; | ||
|
||
public class SimpleApp { | ||
public static void main(String[] args) { | ||
String logFile = "swift://logs.SparkTest/data.log"; | ||
SparkConf conf = new SparkConf().setAppName("Simple Application"); | ||
JavaSparkContext sc = new JavaSparkContext(conf); | ||
sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); | ||
sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); | ||
sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); | ||
|
||
JavaRDD<String> logData = sc.textFile(logFile).cache(); | ||
long num = logData.count(); | ||
|
||
System.out.println("Total number of lines: " + num); | ||
} | ||
} | ||
{% endhighlight %} | ||
|
||
The directory structure is | ||
{% highlight bash %} | ||
./src | ||
./src/main | ||
./src/main/java | ||
./src/main/java/SimpleApp.java | ||
{% endhighlight %} | ||
|
||
Maven pom.xml should contain: | ||
{% highlight xml %} | ||
<project> | ||
<groupId>edu.berkeley</groupId> | ||
<artifactId>simple-project</artifactId> | ||
<modelVersion>4.0.0</modelVersion> | ||
<name>Simple Project</name> | ||
<packaging>jar</packaging> | ||
<version>1.0</version> | ||
<repositories> | ||
<repository> | ||
<id>Akka repository</id> | ||
<url>http://repo.akka.io/releases</url> | ||
</repository> | ||
</repositories> | ||
<build> | ||
<plugins> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-compiler-plugin</artifactId> | ||
<version>2.3</version> | ||
<configuration> | ||
<source>1.6</source> | ||
<target>1.6</target> | ||
</configuration> | ||
</plugin> | ||
</plugins> | ||
</build> | ||
<dependencies> | ||
<dependency> <!-- Spark dependency --> | ||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-core_2.10</artifactId> | ||
<version>1.0.0</version> | ||
</dependency> | ||
</dependencies> | ||
</project> | ||
{% endhighlight %} | ||
|
||
Compile and execute | ||
{% highlight bash %} | ||
mvn package | ||
SPARK_HOME/spark-submit --class SimpleApp --master local[4] target/simple-project-1.0.jar | ||
{% endhighlight %} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it might make sense for this to eventually get into Spark, we need to look more carefully at the dependency that this brings. Since Spark runs different from Hadoop (Spark is really just a user level library), users can always include openstack support in their project dependencies (with the documentation you provide). For the time being, let's first update the documentation so it is obvious & clear & easy for users to add openstack support, and then we can discuss more about whether/when we should push the openstack dependency in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is now a change here that messes up spacing - @gilv can you remove this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'll just fix this up on merge. This is patrick by the way, logged into the QA account by accident!