Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-938 - Openstack Swift object storage support #1010

Closed
wants to merge 14 commits into from
6 changes: 5 additions & 1 deletion core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,11 @@
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</dependency>
<dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it might make sense for this to eventually get into Spark, we need to look more carefully at the dependency that this brings. Since Spark runs different from Hadoop (Spark is really just a user level library), users can always include openstack support in their project dependencies (with the documentation you provide). For the time being, let's first update the documentation so it is obvious & clear & easy for users to add openstack support, and then we can discuss more about whether/when we should push the openstack dependency in Spark.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is now a change here that messes up spacing - @gilv can you remove this change?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'll just fix this up on merge. This is patrick by the way, logged into the QA account by accident!

<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
</dependency>
<dependency>
<groupId>net.java.dev.jets3t</groupId>
<artifactId>jets3t</artifactId>
</dependency>
Expand Down
110 changes: 110 additions & 0 deletions docs/openstack-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
yout: global
title: Accessing Openstack Swift storage from Spark
---

# Accessing Openstack Swift storage from Spark

Spark's file interface allows it to process data in Openstack Swift using the same URI
formats that are supported for Hadoop. You can specify a path in Swift as input through a
URI of the form `swift://<container.service_provider>/path`. You will also need to set your
Swift security credentials, through `SparkContext.hadoopConfiguration`.

#Configuring Hadoop to use Openstack Swift
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver
requieres Swift to use Keystone authentication method. There are recent efforts to support
also temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and
setup Swift FS.

<configuration>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Can we just put this in core-site.xml under conf? (Basically removing the configuring Hadoop section)

<property>
<name>fs.swift.impl</name>
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
</property>
</configuration>

#Configuring Swift
Proxy server of Swift should include `list_endpoints` middleware. More information
available [here] (https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)

#Configuring Spark
To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar`
distributted with Hadoop 2.3.0. For the Maven builds, Spark's main pom.xml should include

<swift.version>2.3.0</swift.version>


<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
<version>${swift.version}</version>
</dependency>

in addition, pom.xml of the `core` and `yarn` projects should include

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
</dependency>


Additional parameters has to be provided to the Swift driver. Swift driver will use those
parameters to perform authentication in Keystone prior accessing Swift. List of mandatory
parameters is : `fs.swift.service.<PROVIDER>.auth.url`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to make a table or a bullet list for this, instead of just comma separate lists.

`fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`,
`fs.swift.service.<PROVIDER>.username`,
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`,
`fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`, where
`PROVIDER` is any name. `fs.swift.service.<PROVIDER>.auth.url` should point to the Keystone
authentication URL.

Create core-sites.xml with the mandatory parameters and place it under /spark/conf
directory. For example:


<property>
<name>fs.swift.service.<PROVIDER>.auth.url</name>
<value>http://127.0.0.1:5000/v2.0/tokens</value>
</property>
<property>
<name>fs.swift.service.<PROVIDER>.auth.endpoint.prefix</name>
<value>endpoints</value>
</property>
<name>fs.swift.service.<PROVIDER>.http.port</name>
<value>8080</value>
</property>
<property>
<name>fs.swift.service.<PROVIDER>.region</name>
<value>RegionOne</value>
</property>
<property>
<name>fs.swift.service.<PROVIDER>.public</name>
<value>true</value>
</property>

We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`,
`fs.swift.service.<PROVIDER>.password`. The best way to provide those parameters to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still doable at runtime by setting the job conf, isn't it?

SparkContext in run time, which seems to be impossible yet.
Another approach is to adapt Swift driver to obtain those values from system environment
variables. For now we provide them via core-sites.xml.
Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml
shoud include:

<property>
<name>fs.swift.service.<PROVIDER>.tenant</name>
<value>test</value>
</property>
<property>
<name>fs.swift.service.<PROVIDER>.username</name>
<value>tester</value>
</property>
<property>
<name>fs.swift.service.<PROVIDER>.password</name>
<value>testing</value>
</property>
# Usage
Assume there exists Swift container `logs` with an object `data.log`. To access `data.log`
from Spark the `swift://` scheme should be used. For example:

val sfdata = sc.textFile("swift://logs.<PROVIDER>/data.log")

11 changes: 11 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@
<codahale.metrics.version>3.0.0</codahale.metrics.version>
<avro.version>1.7.6</avro.version>
<jets3t.version>0.7.1</jets3t.version>
<swift.version>2.3.0</swift.version>

<PermGen>64m</PermGen>
<MaxPermGen>512m</MaxPermGen>
Expand Down Expand Up @@ -584,6 +585,11 @@
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
<version>${swift.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-api</artifactId>
Expand Down Expand Up @@ -1024,6 +1030,11 @@
<artifactId>hadoop-client</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-api</artifactId>
Expand Down
4 changes: 4 additions & 0 deletions yarn/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.binary.version}</artifactId>
Expand Down