-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-938 - Openstack Swift object storage support #1010
Closed
Closed
Changes from 2 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
b6c37ef
Openstack Swift support
gilv ce483d7
SPARK-938 - Openstack Swift object storage support
gilv eff538d
SPARK-938 - Openstack Swift object storage support
gilv 9b625b5
Merge branch 'master' of https://github.com/gilv/spark
gilv 2aba763
Fix to docs/openstack-integration.md
gilv c977658
Merge branch 'master' of https://github.com/gilv/spark
gilv 39a9737
Spark integration with Openstack Swift
gilv eb22295
Merge pull request #1010 from gilv/master
rxin 99f095d
Pending openstack changes.
rxin cca7192
Removed white spases from pom.xml
gilv 47ce99d
Merge branch 'master' into openstack
rxin ac0679e
Fixed an unclosed tr.
rxin 6994827
Merge pull request #1 from rxin/openstack
gilv 9233fef
Fixed typos
gilv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
--- | ||
layout: global | ||
title: Accessing Openstack Swift storage from Spark | ||
--- | ||
|
||
# Accessing Openstack Swift storage from Spark | ||
|
||
Spark's file interface allows it to process data in Openstack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a URI of the form `swift://<container.service_provider>/path`. You will also need to set your Swift security credentials, through `SparkContext.hadoopConfiguration`. | ||
|
||
#Configuring Hadoop to use Openstack Swift | ||
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)) Users that wish to use previous Hadoop versions will need to configure Swift driver manually. | ||
<h2>Hadoop 2.3.0 and above.</h2> | ||
An Openstack Swift driver was merged into Haddop 2.3.0 . Current Hadoop driver requieres Swift to use Keystone authentication. There are additional efforts to support temp auth for Hadoop [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). | ||
To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and setup Swift FS. | ||
|
||
<configuration> | ||
<property> | ||
<name>fs.swift.impl</name> | ||
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value> | ||
</property> | ||
</configuration> | ||
|
||
|
||
<h2>Configuring Spark - stand alone cluster</h2> | ||
You need to configure the compute-classpath.sh and add Hadoop classpath for | ||
|
||
|
||
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/common/lib/* | ||
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/hdfs/* | ||
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/tools/lib/* | ||
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/hdfs/lib/* | ||
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/mapreduce/* | ||
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/mapreduce/lib/* | ||
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/yarn/* | ||
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/yarn/lib/* | ||
|
||
Additional parameters has to be provided to the Hadoop from Spark. Swift driver of Hadoop uses those parameters to perform authentication in Keystone needed to access Swift. | ||
List of mandatory parameters is : `fs.swift.service.<PROVIDER>.auth.url`, `fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`, | ||
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`. | ||
Create core-sites.xml and place it under /spark/conf directory. Configure core-sites.xml with general Keystone parameters, for example | ||
|
||
|
||
<property> | ||
<name>fs.swift.service.<PROVIDER>.auth.url</name> | ||
<value>http://127.0.0.1:5000/v2.0/tokens</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.<PROVIDER>.auth.endpoint.prefix</name> | ||
<value>endpoints</value> | ||
</property> | ||
<name>fs.swift.service.<PROVIDER>.http.port</name> | ||
<value>8080</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.<PROVIDER>.region</name> | ||
<value>RegionOne</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.<PROVIDER>.public</name> | ||
<value>true</value> | ||
</property> | ||
|
||
We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`, `fs.swift.service.<PROVIDER>.password`. The best way to provide those parameters to SparkContext in run time, which seems to be impossible yet. | ||
Another approach is to change Hadoop Swift FS driver to provide them via system environment variables. For now we provide them via core-sites.xml | ||
|
||
<property> | ||
<name>fs.swift.service.<PROVIDER>.tenant</name> | ||
<value>test</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.<PROVIDER>.username</name> | ||
<value>tester</value> | ||
</property> | ||
<property> | ||
<name>fs.swift.service.<PROVIDER>.password</name> | ||
<value>testing</value> | ||
</property> | ||
<property> | ||
<h3> Usage </h3> | ||
Assume you have a Swift container `logs` with an object `data.log`. You can use `swift://` scheme to access objects from Swift. | ||
|
||
val sfdata = sc.textFile("swift://logs.<PROVIDER>/data.log") | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the swift jar not included in hadoop-client? Is there a way to specify this through Maven dependencies rather than manually including the path?