copyright

lastupdated

years
2018, 2019

2019-03-07

{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:codeblock: .codeblock} {:screen: .screen} {:tip: .tip} {:pre: .pre}

Build a data lake using object storage

{: #smart-data-lake}

Definitions of the term data lake vary, but in the context of this tutorial, a data lake is an approach to storing data in its native format for organizational use. To that end you will create a data lake for your organization using {{site.data.keyword.cos_short}}. By combining {{site.data.keyword.cos_short}} and SQL Query, data analysts can query data where it lies using SQL. You'll also leverage the SQL Query service in a Jupyter Notebook to conduct a simple analysis. When you're done, allow non-technical users to discover their own insights using {{site.data.keyword.dynamdashbemb_notm}}.

Objectives

Use {{site.data.keyword.cos_short}} to store raw data files
Query data directly from {{site.data.keyword.cos_short}} using SQL Query
Refine and analyze data in {{site.data.keyword.DSX_full}}
Share data across your organization with {{site.data.keyword.dynamdashbemb_notm}}

Services used

{{site.data.keyword.cos_short}}
SQL Query
{{site.data.keyword.DSX}}
{{site.data.keyword.dynamdashbemb_notm}}

Architecture

Raw data is stored on {{site.data.keyword.cos_short}}
Data is reduced, enhanced or refined with SQL Query
Data analysis occurs in {{site.data.keyword.DSX}}
Line of business accesses a web application
Refined data is pulled from {{site.data.keyword.cos_short}}
Line of business charts are built using {{site.data.keyword.dynamdashbemb_notm}}

Before you begin

Install Git
Install {{site.data.keyword.Bluemix_notm}} CLI
Install Aspera Connect
Install Node.js and NPM

Create services

In this section, you will create the services required to build your data lake.

This section uses the command line to create service instances. Alternatively, you may do the same from the service page in the catalog using the provided links. {: tip}

Login to {{site.data.keyword.cloud_notm}} via the command line and target your Cloud Foundry account. See [CLI Getting Started]https://{DomainName}/docs/cli?topic=cloud-cli-ibmcloud-cli#overview).
```
ibmcloud login
```
{: pre}
```
ibmcloud target --cf
```
{: pre}
Create an instance of {{site.data.keyword.cos_short}} with a Cloud Foundry alias. If you already have a service instance, run the service-alias-create command with the existing service name.
```
ibmcloud resource service-instance-create data-lake-cos cloud-object-storage lite global
```
{: pre}
```
ibmcloud resource service-alias-create dashboard-nodejs-cos --instance-name data-lake-cos
```
{: pre}

Create an instance of SQL Query.

ibmcloud resource service-instance-create data-lake-sql sql-query lite us-south

{: pre}

Create an instance of {{site.data.keyword.DSX}}.

ibmcloud service create data-science-experience free-v1 data-lake-studio

{: pre}

Create an instance of {{site.data.keyword.dynamdashbemb_notm}} with a Cloud Foundry alias.

ibmcloud resource service-instance-create data-lake-dde dynamic-dashboard-embedded lite us-south

{: pre}

ibmcloud resource service-alias-create dashboard-nodejs-dde --instance-name data-lake-dde

{: pre}

Change to a working directory and run the following command to clone the dashboard application's GitHub repository. Then push the application to your Cloud Foundy organization. The application will automatically bind the required services from above using its manifest.yml file.
```
git clone https://github.com/IBM-Cloud/nodejs-data-lake-dashboard.git
```
{: pre}
```
cd nodejs-data-lake-dashboard
```
{: pre}
```
npm install
```
{: pre}
```
npm run push
```
{: pre}

After deployment, the application will be public and listening on a random hostname. You can got to the Resource List page, select the app under Cloud Foundry Apps and view the URL or run the command ibmcloud cf app dashboard-nodejs routes to see routes. {: tip}
Confirm the application is active by accessing its public URL in the browser.

Uploading data

In this section, you will upload data to an {{site.data.keyword.cos_short}} bucket using built-in {{site.data.keyword.CHSTSshort}}. {{site.data.keyword.CHSTSshort}} protects data as it is uploaded to the bucket and can greatly reduce transfer time.

Download the City of Los Angeles / Traffic Collision Data from 2010 CSV file. The file is 81MB and may take a few minutes to download.
In your browser, access the data-lake-cos service instance from the Resource List.
Create a new bucket to store data.
- Click the Create a bucket button.
- Select Regional from the Resiliency drop down.
- Select us-south from the Location. {{site.data.keyword.CHSTSshort}} is only available for buckets created in the us-south location at this time. Alternatively, choose another location and use the Standard transfer type in the next section.
- Provide a bucket Name and click Create If you receive an AccessDenied error, try with a more unique bucket name.
Upload the CSV file to {{site.data.keyword.cos_short}}.
- From your bucket, click the Add objects button.
- Select the Aspera high-speed transfer radio button.
- Click the Add files button. This will open the Aspera plugin, which will be in a separate window - possibly behind your browser window.
- Browse to and select the previously downloaded CSV file.

Working with data

In this section, you will convert the original, raw dataset into a targetted cohort based on time and age attributes. This is helpful to consumers of the data lake who have specific interests or would struggle with very large datasets.

You will use SQL Query to manipulate the data where it resides in {{site.data.keyword.cos_short}} using familar SQL statements. SQL Query has built-in support for CSV, JSON and Parquet - no additional computation services or extract-transform-load is necessary.

Access the data-lake-sql SQL Query service instance from your Resource List.
Select Open UI.

Create a new dataset by executing SQL directly on the previously uploaded CSV file.

Enter the following SQL into the Type SQL here ... text area.

SELECT
`Dr Number` AS id,
`Date Occurred` AS date,
`Time Occurred` AS time,
`Area Name` AS area,
`Victim Age` AS age,
`Victim Sex` AS sex,
`Location` AS location
FROM cos://us-south/<your-bucket-name>/Traffic_Collision_Data_from_2010_to_Present.csv
WHERE
`Time Occurred` >= 1700 AND
`Time Occurred` <= 2000 AND
`Victim Age` >= 20 AND
`Victim Age` <= 35

{: codeblock}

Replace the URL in the FROM clause with your bucket's name.

The Target will auto-create a {{site.data.keyword.cos_short}} bucket to hold the result. Change the Target to cos://us-south/<your-bucket-name>/results.
Click the Run button. The results will appear below.
On the Query Details tab, click the Launch icon next after the Result Location URL to view the intermediate dataset, which is now also stored on {{site.data.keyword.cos_short}}.

Combine Jupyter Notebooks with SQL Query

In this section, you will use the SQL Query client within a Jupyter Notebook. This re-uses the data stored on {{site.data.keyword.cos_short}} within a data analysis tool. The combination also creates datasets that are automatically stored in {{site.data.keyword.cos_short}} that can then be used with {{site.data.keyword.dynamdashbemb_notm}}.

Create a new Jupyter Notebook in {{site.data.keyword.DSX}}.
- In a browser, open {{site.data.keyword.DSX}}.
- Click Create a Project tile followed by Data Science.
- Click Create project and then provide a Project name.
- Ensure Storage is set to data-lake-cos.
- Click Create.
- In the resulting project, click Add to project and Notebook.
- From the Blank tab, enter a Notebook name.
- Leave the Language and Runtime defaults; click Create notebook.
From the Notebook, install and import PixieDust and ibmcloudsql by adding the following commands to the In [ ]: input prompt and then Run.
```
!conda install pyarrow
!conda install sqlparse
!pip install --user ibmcloudsql
import ibmcloudsql
from pixiedust.display import *
```
{: codeblock}
Add a {{site.data.keyword.cos_short}} API key to the Notebook. This will allow SQL Query results to be stored in {{site.data.keyword.cos_short}}.
- Add the following in the next In [ ]: prompt and then Run.
```
import getpass
cloud_api_key = getpass.getpass('Enter your IBM Cloud API Key')
```
  {: codeblock}
- From the terminal, create an API key.
```
ibmcloud iam api-key-create data-lake-cos-key
```
  {: pre}
- Copy the API Key to the clipboard.
- Paste the API Key into the textbox in the Notebook and hit the enter key.
- You should also store the API Key to a secure, permanent location; the Notebook does not store the API key.
Add the SQL Query instance's CRN (Cloud Resource Name) to the Notebook.
- In the next In [ ]: prompt, assign the CRN to a variable in your Notebook.
```
sql_crn = '<SQL_QUERY_CRN>'
```
  {: codeblock}
- From the terminal, copy the CRN from the ID property to your clipboard.
```
ibmcloud resource service-instance data-lake-sql
```
  {: pre}
- Paste the CRN between the single quotes and then Run.
Add another variable to the Notebook to specify the {{site.data.keyword.cos_short}} bucket and Run.
```
sql_cos_endpoint = 'cos://us-south/<your-bucket-name>'
```
{: codeblock}

Execute the following commands in another In [ ]: prompt and Run to view the result set. You will also have new accidents/jobid=<id>/<part>.csv* file added to your bucket that includes the result of the SELECT.

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint + '/accidents')

data_source = sql_cos_endpoint + "/Traffic_Collision_Data_from_2010_to_Present.csv"

query = """
SELECT
    `Time Occurred` AS time,
    `Area Name` AS area,
    `Victim Age` AS age,
    `Victim Sex` AS sex,
    `Location` AS location
FROM  {}
WHERE
    `Time Occurred` >= 1700 AND `Time Occurred` <= 2000 AND
    `Victim Age` >= 20 AND `Victim Age` <= 35
""".format(data_source)

traffic_collisions = sqlClient.run_sql(query)
traffic_collisions.head()

{: codeblock}

Visualize data using PixieDust

In this section, you will visualize the previous result set using PixieDust and Mapbox to better identify patterns or hot spots for traffic incidents.

Create a common table expression to convert the location column to separate latitude and longitude columns. Run the following from the Notebook's prompt.

query = """
WITH location AS (
    SELECT
        id,
        cast(split(coordinates, ',')[0] as float) as latitude,
        cast(split(coordinates, ',')[1] as float) as longitude
    FROM (SELECT
            `Dr Number` as id,
            regexp_replace(Location, '[()]', '') as coordinates
        FROM {0}
    )
)
SELECT
    d.`Dr Number` as id,
    d.`Date Occurred` as date,
    d.`Time Occurred` AS time,
    d.`Area Name` AS area,
    d.`Victim Age` AS age,
    d.`Victim Sex` AS sex,
    l.latitude,
    l.longitude
FROM {0} AS d
    JOIN
    location AS l
    ON l.id = d.`Dr Number`
WHERE
    d.`Time Occurred` >= 1700 AND
    d.`Time Occurred` <= 2000 AND
    d.`Victim Age` >= 20 AND
    d.`Victim Age` <= 35 AND
    l.latitude != 0.0000 AND
    l.latitude != 0.0000
""".format(data_source)

traffic_location = sqlClient.run_sql(query)
traffic_location.head()

{: codeblock}

In the next In [ ]: prompt Run the display command to view the result using PixieDust.
```
display(traffic_location)
```
{: codeblock}
Select the chart dropdown button; then select Map.
Add latitude and longitude to Keys. Add id and age to Values. Click OK to view the map.
Click the Save icon to save your Notebook to {{site.data.keyword.cos_short}}.

Share your dataset with the organization

Not every user of the data lake is a data scientist. You can allow non-technical users to gain insight from the data lake using {{site.data.keyword.dynamdashbemb_notm}}. Similar to SQL Query, {{site.data.keyword.dynamdashbemb_notm}} can read data directly from {{site.data.keyword.cos_short}} using pre-built dashboards. This section presents a solution that allows any user to access the data lake and build a custom dashboard.

Access the public URL of the dashboard application you pushed to {{site.data.keyword.Bluemix_notm}} previously.
Select a template that matches your intended layout. (The following steps use the second layout in the first row.)
Use the Add a source button that appears in the Selected sources side shelf, expand the bucket name accoridan and click one of the accidents/jobid=... table entries. Close the dialog using the X icon in the upper right.
On left, click the Visualizations icon and then click Summary.
Select the accidents/jobid=... source, expand Table and create a chart.
- Drag and drop id on the Value row.
- Collapse the chart using the icon on the upper corner.
Again from Visualizations create a Tree map chart:
- Drag and drop area on the Area hierarchy row.
- Drag and drop id on the Size row.
- Collapse the chart to view the result.

Explore your dashboard

In this section, you'll take a few additional steps to explore the features of the dashboard application and {{site.data.keyword.dynamdashbemb_notm}}.

Click the Mode button in the sample dashboard application's toolbar to change the mode view VIEW.
Click any of the colored tiles in the lower chart or area values in the chart's legend. This applies a local filter to the tab, which causes the other chart(s) to show data specific to the filter.
Click the Save button in the toolbar.
- Enter your dashboard's name in the corresponding input field.
- Select the Spec tab to view this dashboard's specification. A spec is the native file format for {{site.data.keyword.dynamdashbemb_notm}}. In it you will find information about the charts you created as well as the {{site.data.keyword.cos_short}} data source used.
- Save your dashboard to the browser's local storage using the dialog's Save button.
Click the toolbar's New button to create a new dashboard. To open a saved dashboard, click the Open button. To delete a dashboard, use the Delete icon on the Open Dashboard dialog.

In production applications, encrypt information such as URLs, usernames and passwords to prevent them from being seen by end users. See Encrypting data source information. {: tip}

Expand the tutorial

Congratulations, you have built a data lake using {{site.data.keyword.cos_short}}. Below are additional suggestions to enhance your data lake.

Experiment with additional datasets using SQL Query
Stream data from multiple sources into your data lake by completing Big data logs with streaming analytics and SQL
Edit the dashboard application's code to store dashboard specifications to {{site.data.keyword.cloudant_short_notm}} or {{site.data.keyword.cos_short}}
Create an {{site.data.keyword.appid_full_notm}} service instance to enable security in the dashboard application

Remove resources

Run the following commands to remove services, applications and keys used.

ibmcloud resource service-binding-delete dashboard-nodejs-dde dashboard-nodejs

{: pre}

ibmcloud resource service-binding-delete dashboard-nodejs-cos dashboard-nodejs

{: pre}

ibmcloud resource service-alias-delete dashboard-nodejs-dde

{: pre}

ibmcloud resource service-alias-delete dashboard-nodejs-cos

{: pre}

ibmcloud iam api-key-delete data-lake-cos-key

{: pre}

ibmcloud resource service-instance-delete data-lake-dde

{: pre}

ibmcloud resource service-instance-delete data-lake-cos

{: pre}

ibmcloud resource service-instance-delete data-lake-sql

{: pre}

ibmcloud service delete data-lake-studio

{: pre}

ibmcloud app delete dashboard-nodejs

{: pre}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smart-data-lake.md

smart-data-lake.md

Build a data lake using object storage

Objectives

Services used

Architecture

Before you begin

Create services

Uploading data

Working with data

Combine Jupyter Notebooks with SQL Query

Visualize data using PixieDust

Share your dataset with the organization

Explore your dashboard

Expand the tutorial

Remove resources

Related content

Files

smart-data-lake.md

Latest commit

History

smart-data-lake.md

File metadata and controls

Build a data lake using object storage

Objectives

Services used

Architecture

Before you begin

Create services

Uploading data

Working with data

Combine Jupyter Notebooks with SQL Query

Visualize data using PixieDust

Share your dataset with the organization

Explore your dashboard

Expand the tutorial

Remove resources

Related content