diff --git a/core/pom.xml b/core/pom.xml index 93dadafe57046..fe6b2daba0581 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -35,10 +35,6 @@ org.apache.hadoop hadoop-client - - org.apache.hadoop - hadoop-openstack - net.java.dev.jets3t jets3t diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index a1aac02f6275e..a3179fce59c13 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -1,110 +1,237 @@ -yout: global -title: Accessing Openstack Swift storage from Spark +layout: global +title: Accessing Openstack Swift from Spark --- -# Accessing Openstack Swift storage from Spark +# Accessing Openstack Swift from Spark Spark's file interface allows it to process data in Openstack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a -URI of the form `swift:///path`. You will also need to set your -Swift security credentials, through `SparkContext.hadoopConfiguration`. - -#Configuring Hadoop to use Openstack Swift -Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver +URI of the form `swift:// - - fs.swift.impl - org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem - - +temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). -#Configuring Swift +# Configuring Swift Proxy server of Swift should include `list_endpoints` middleware. More information -available [here] (https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) - -#Configuring Spark -To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar` -distributted with Hadoop 2.3.0. For the Maven builds, Spark's main pom.xml should include - - 2.3.0 +available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) +# Compilation of Spark +Spark should be compiled with `hadoop-openstack-2.3.0.jar` that is distributted with Hadoop 2.3.0. +For the Maven builds, the `dependencyManagement` section of Spark's main `pom.xml` should include + + --------- org.apache.hadoop hadoop-openstack - ${swift.version} + 2.3.0 + ---------- + -in addition, pom.xml of the `core` and `yarn` projects should include +in addition, both `core` and `yarn` projects should add `hadoop-openstack` to the `dependencies` section of their `pom.xml` + + ---------- org.apache.hadoop hadoop-openstack + ---------- + +# Configuration of Spark +Create `core-sites.xml` and place it inside `/spark/conf` directory. There are two main categories of parameters that should to be +configured: declaration of the Swift driver and the parameters that are required by Keystone. + +Configuration of Hadoop to use Swift File system achieved via + + + + + + + +
Property NameValue
fs.swift.implorg.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem
+ +Additional parameters requiered by Keystone and should be provided to the Swift driver. Those +parameters will be used to perform authentication in Keystone to access Swift. The following table +contains a list of Keystone mandatory parameters. `PROVIDER` can be any name. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameMeaningRequired
fs.swift.service.PROVIDER.auth.urlKeystone Authentication URLMandatory
fs.swift.service.PROVIDER.auth.endpoint.prefixKeystone endpoints prefixOptional
fs.swift.service.PROVIDER.tenantTenantMandatory
fs.swift.service.PROVIDER.usernameUsernameMandatory
fs.swift.service.PROVIDER.passwordPasswordMandatory
fs.swift.service.PROVIDER.http.portHTTP portMandatory
fs.swift.service.PROVIDER.regionKeystone regionMandatory
fs.swift.service.PROVIDER.publicIndicates if all URLs are publicMandatory
+ +For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `tenant`. +Than `core-sites.xml` should include: + + + fs.swift.impl + org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem + + + fs.swift.service.SparkTest.auth.url + http://127.0.0.1:5000/v2.0/tokens + + + fs.swift.service.SparkTest.auth.endpoint.prefix + endpoints + + fs.swift.service.SparkTest.http.port + 8080 + + + fs.swift.service.SparkTest.region + RegionOne + + + fs.swift.service.SparkTest.public + true + + + fs.swift.service.SparkTest.tenant + test + + + fs.swift.service.SparkTest.username + tester + + + fs.swift.service.SparkTest.password + testing + + -Additional parameters has to be provided to the Swift driver. Swift driver will use those -parameters to perform authentication in Keystone prior accessing Swift. List of mandatory -parameters is : `fs.swift.service..auth.url`, -`fs.swift.service..auth.endpoint.prefix`, `fs.swift.service..tenant`, -`fs.swift.service..username`, -`fs.swift.service..password`, `fs.swift.service..http.port`, -`fs.swift.service..http.port`, `fs.swift.service..public`, where -`PROVIDER` is any name. `fs.swift.service..auth.url` should point to the Keystone -authentication URL. - -Create core-sites.xml with the mandatory parameters and place it under /spark/conf -directory. For example: - - - - fs.swift.service..auth.url - http://127.0.0.1:5000/v2.0/tokens - - - fs.swift.service..auth.endpoint.prefix - endpoints - - fs.swift.service..http.port - 8080 - - - fs.swift.service..region - RegionOne - - - fs.swift.service..public - true - - -We left with `fs.swift.service..tenant`, `fs.swift.service..username`, -`fs.swift.service..password`. The best way to provide those parameters to -SparkContext in run time, which seems to be impossible yet. -Another approach is to adapt Swift driver to obtain those values from system environment -variables. For now we provide them via core-sites.xml. -Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml -shoud include: - - - fs.swift.service..tenant - test - - - fs.swift.service..username - tester - - - fs.swift.service..password - testing - -# Usage -Assume there exists Swift container `logs` with an object `data.log`. To access `data.log` -from Spark the `swift://` scheme should be used. For example: - - val sfdata = sc.textFile("swift://logs./data.log") +Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`, +`fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-sites.xml` is not always a good approach. +We suggest to keep those parameters in `core-sites.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration` + +# Usage examples +Assume Keystone's authentication URL is `http://127.0.0.1:5000/v2.0/tokens` and Keystone contains tenant `test`, user `tester` with password `testing`. In our example we define `PROVIDER=SparkTest`. Assume that Swift contains container `logs` with an object `data.log`. To access `data.log` +from Spark the `swift://` scheme should be used. + +## Running Spark via spark-shell +Make sure that `core-sites.xml` contains `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, +`fs.swift.service.SparkTest.password`. Run Spark via `spark-shell` and access Swift via `swift:\\` scheme. + + val sfdata = sc.textFile("swift://logs.SparkTest/data.log") + sfdata.count() + +## Job submission via spark-submit +In this case `core-sites.xml` need not contain `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, +`fs.swift.service.SparkTest.password`. Example of Java usage: + + /* SimpleApp.java */ + import org.apache.spark.api.java.*; + import org.apache.spark.SparkConf; + import org.apache.spark.api.java.function.Function; + + public class SimpleApp { + public static void main(String[] args) { + String logFile = "swift://logs.SparkTest/data.log"; + SparkConf conf = new SparkConf().setAppName("Simple Application"); + JavaSparkContext sc = new JavaSparkContext(conf); + sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); + + JavaRDD logData = sc.textFile(logFile).cache(); + + long num = logData.count(); + + System.out.println("Total number of lines: " + num); + } + } + +The directory sturture is + + find . + ./src + ./src/main + ./src/main/java + ./src/main/java/SimpleApp.java + +Maven pom.xml is + + + edu.berkeley + simple-project + 4.0.0 + Simple Project + jar + 1.0 + + + Akka repository + http://repo.akka.io/releases + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 2.3 + + 1.6 + 1.6 + + + + + + + org.apache.spark + spark-core_2.10 + 1.0.0 + + + + + +Compile and execute + + mvn package + SPARK_HOME/spark-submit --class "SimpleApp" --master local[4] target/simple-project-1.0.jar diff --git a/pom.xml b/pom.xml index 79cf5fdc23d01..92cf6bab1edf8 100644 --- a/pom.xml +++ b/pom.xml @@ -132,8 +132,7 @@ 3.0.0 1.7.6 0.7.1 - 2.3.0 - + 64m 512m @@ -585,11 +584,6 @@
- - org.apache.hadoop - hadoop-openstack - ${swift.version} - org.apache.hadoop hadoop-yarn-api @@ -1030,11 +1024,6 @@ hadoop-client provided - - org.apache.hadoop - hadoop-openstack - provided - org.apache.hadoop hadoop-yarn-api diff --git a/yarn/pom.xml b/yarn/pom.xml index e58d8312f1a86..6993c89525d8c 100644 --- a/yarn/pom.xml +++ b/yarn/pom.xml @@ -55,10 +55,6 @@ org.apache.hadoop hadoop-client - - org.apache.hadoop - hadoop-openstack - org.scalatest scalatest_${scala.binary.version}