Skip to content

Quick start Spark

sapessi edited this page Feb 27, 2018 · 24 revisions

You can use the aws-serverless-java-container library to run a Spark application in AWS Lambda. You can use the library within your Lambda handler to proxy events to the Spark instance.

In the repository we have included a sample Spark application to get you started.

Maven archetype

You can quickly create a new serverless Spark application using our Maven archetype. First, make sure Maven is installed in your environment and available in your PATH. Next, using a terminal or your favorite IDE create a new application, the archetype groupId is com.amazonaws.serverless.archetypes and the artifactId is aws-serverless-spark-archetype;

mvn archetype:generate -DarchetypeGroupId=com.amazonaws.serverless.archetypes -DarchetypeArtifactId=aws-serverless-spark-archetype -DarchetypeVersion=1.0.1 -DgroupId=my.service -DartifactId=my-service -Dversion=1.0-SNAPSHOT

The archetype sets up a new maven project. The pom.xml includes the dependencies you will need to build a basic Spark API that can consume and product JSON data. The generated code includes a StreamLambdaHandler class, the main entry point for AWS Lambda; a SparkResources class that defines a /ping resource; and a set of unit tests that exercise the application.

The project also includes a file called sam.yaml. This is a SAM template that you can use to quickly test your application in local or deploy it to AWS. Open the README.md file in the project root folder for instructions on how to use SAM Local to run your Serverless API or deploy it to AWS.

Manual setup / Converting existing projects

1. Import dependencies

The first step is to import the Spark implementation of the library:

<dependency>
    <groupId>com.amazonaws.serverless</groupId>
    <artifactId>aws-serverless-java-container-spark</artifactId>
    <version>[0.8,)</version>
</dependency>

This will automatically also import the aws-serverless-java-container-core and aws-lambda-java-core libraries.

2. Create the Lambda handler

In your application package declare a new class that implements Lambda's RequestStreamHandler interface. If you have configured API Gateway with a proxy integration, you can use the built-in POJOs AwsProxyRequest and AwsProxyResponse.

The next step is to declare the container handler object. The library exposes a utility static method that configures a SparkLambdaContainerHandler object for AWS proxy events. The handler object should be declared as a class property and be static. By doing this, Lambda will re-use the instance for subsequent requests.

The handleRequest method of the class can use the handler object we declared in the previous step to send requests to the Spring application.

On the first run, the handleRequest method initializes the handler object and then configures the Spark routes. It's important to configure the Spark routes only after the handler is initialized.

public class StreamLambdaHandler implements RequestStreamHandler {
    private static SparkLambdaContainerHandler<AwsProxyRequest, AwsProxyResponse> handler;
    static {
        try {
            handler = SparkLambdaContainerHandler.getAwsProxyHandler();
            SparkResources.defineResources();
            Spark.awaitInitialization();
        } catch (ContainerInitializationException e) {
            // if we fail here. We re-throw the exception to force another cold start
            e.printStackTrace();
            throw new RuntimeException("Could not initialize Spark container", e);
        }
    }

    @Override
    public void handleRequest(InputStream inputStream, OutputStream outputStream, Context context)
            throws IOException {
        handler.proxyStream(inputStream, outputStream, context);

        // just in case it wasn't closed by the mapper
        outputStream.close();
    }
}

In our sample application, Spark methods are initialized in a static method of the separate SparkResources class.

public static void defineResources() {
    before((request, response) -> response.type("application/json"));

    post("/pets", (req, res) -> {
        Pet newPet = LambdaContainerHandler.getObjectMapper().readValue(req.body(), Pet.class);
       if (newPet.getName() == null || newPet.getBreed() == null) {
           return Response.status(400).entity(new Error("Invalid name or breed")).build();
       }

        Pet dbPet = newPet;
        dbPet.setId(UUID.randomUUID().toString());

        res.status(200);
        return dbPet;
    }, new JsonTransformer());
}

3. Packaging the application

By default, Spark includes an embedded Jetty web server with web sockets support. Because the serverless-java-container library acts as the web server we do not need the Jetty web socket files in our deployment jar. You can configure the maven shade plugin to exclude Jetty from the deployment package.

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.3</version>
            <configuration>
                <createDependencyReducedPom>false</createDependencyReducedPom>
            </configuration>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <artifactSet>
                            <excludes>
                                <exclude>org.eclipse.jetty.websocket:*</exclude>
                            </excludes>
                        </artifactSet>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

4. Publish your Lambda function

You can follow the instructions in AWS Lambda's documentation on how to package your function for deployment.

Clone this wiki locally