atqy · atqy · May 4, 2022 · May 4, 2022 · May 4, 2022 · May 5, 2022
diff --git a/end_to_end/music_recommendation/00_overview_arch_data.ipynb b/end_to_end/music_recommendation/00_overview_arch_data.ipynb
diff --git a/end_to_end/music_recommendation/01_data_exploration.ipynb b/end_to_end/music_recommendation/01_data_exploration.ipynb
@@ -0,0 +1,301 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Music Recommender Data Exploration"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "----\n",
+    "\n",
+    "## Background\n",
+    "\n",
+    "This notebook is part of a notebook series that goes through the ML lifecycle and shows how we can build a Music Recommender System using a combination of SageMaker services and features. In this notebook, we will be focusing on exploring the data. It is the first notebook in a series of notebooks. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the [README.md](README.md) for more information about this use case implement of this sequence of notebooks. \n",
+    "\n",
+    "1. [Music Recommender Data Exploration](01_data_exploration.ipynb) (current notebook)\n",
+    "1. [Music Recommender Data Preparation with SageMaker Feature Store and SageMaker Data Wrangler](02_export_feature_groups.ipynb)\n",
+    "1. [Train, Deploy, and Monitor the Music Recommender Model using SageMaker SDK](03_train_deploy_debugger_explain_monitor_registry.ipynb)\n",
+    "\n",
+    "----\n",
+    "\n",
+    "## Contents\n",
+    "1. [Prereqs: Get Data](#Prereqs:-Get-Data)\n",
+    "1. [Update the Data Source in the .flow File](#Update-the-Data-Source-in-the-.flow-File)\n",
+    "1. [Explore the Data](#Explore-the-Data)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "import pprint\n",
+    "\n",
+    "sys.path.insert(1, \"./code\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# update pandas to avoid data type issues in older 1.0 version\n",
+    "!pip install pandas --upgrade --quiet\n",
+    "import pandas as pd\n",
+    "\n",
+    "print(pd.__version__)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create data folder\n",
+    "!mkdir data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "%matplotlib inline\n",
+    "\n",
+    "import json\n",
+    "import sagemaker\n",
+    "import boto3\n",
+    "import os\n",
+    "\n",
+    "# Sagemaker session\n",
+    "sess = sagemaker.Session()\n",
+    "# get session bucket name\n",
+    "bucket = sess.default_bucket()\n",
+    "# bucket prefix or the subfolder for everything we produce\n",
+    "prefix = \"music-recommendation\"\n",
+    "# s3 client\n",
+    "s3_client = boto3.client(\"s3\")\n",
+    "\n",
+    "print(f\"this is your default SageMaker Studio bucket name: {bucket}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prereqs: Get Data \n",
+    "\n",
+    "----\n",
+    "\n",
+    "Here we will download the music data from a public S3 bucket that we'll be using for this demo and uploads it to your default S3 bucket that was created for you when you initially created a SageMaker Studio workspace. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from demo_helpers import get_data, get_model, update_data_sources"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# public S3 bucket that contains our music data\n",
+    "s3_bucket_music_data = \"s3://sagemaker-sample-files/datasets/tabular/synthetic-music\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "new_data_paths = get_data(\n",
+    "    s3_client,\n",
+    "    [f\"{s3_bucket_music_data}/tracks.csv\", f\"{s3_bucket_music_data}/ratings.csv\"],\n",
+    "    bucket,\n",
+    "    prefix,\n",
+    "    sample_data=0.70,\n",
+    ")\n",
+    "print(new_data_paths)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# these are the new file paths located on your SageMaker Studio default s3 storage bucket\n",
+    "tracks_data_source = f\"s3://{bucket}/{prefix}/tracks.csv\"\n",
+    "ratings_data_source = f\"s3://{bucket}/{prefix}/ratings.csv\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Update the Data Source in the .flow File\n",
+    "\n",
+    "----\n",
+    "\n",
+    "The `01_music_dataprep.flow` file is a JSON file containing instructions for where to find your data sources and how to transform the data. We'll be updating the object telling Data Wrangler where to find the input data on S3. We will set this to your default S3 bucket. With this update to the `.flow` file it now points to your new S3 bucket as the data source used by SageMaker Data Wrangler.\n",
+    "\n",
+    "Make sure the `.flow` file is closed before running this next step or it won't update the new s3 file locations in the file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "update_data_sources(\"01_music_dataprep.flow\", tracks_data_source, ratings_data_source)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Explore the Data\n",
+    "\n",
+    "----"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tracks = pd.read_csv(\"./data/tracks.csv\")\n",
+    "ratings = pd.read_csv(\"./data/ratings.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tracks.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ratings.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"{:,} different songs/tracks\".format(tracks[\"trackId\"].nunique()))\n",
+    "print(\"{:,} users\".format(ratings[\"userId\"].nunique()))\n",
+    "print(\"{:,} user rating events\".format(ratings[\"ratingEventId\"].nunique()))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tracks.groupby(\"genre\")[\"genre\"].count().plot.bar(title=\"Tracks by Genre\");"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ratings[[\"ratingEventId\", \"userId\"]].plot.hist(\n",
+    "    by=\"userId\", bins=50, title=\"Distribution of # of Ratings by User\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create some new data to ingest later"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tracks_new = tracks[:300]\n",
+    "ratings_new = ratings[:1000]\n",
+    "\n",
+    "# export dataframes to csv\n",
+    "tracks_new.to_csv(\"./data/tracks_new.csv\", index=False)\n",
+    "ratings_new.to_csv(\"./data/ratings_new.csv\", index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "s3_client.upload_file(\n",
+    "    Filename=\"./data/tracks_new.csv\", Bucket=bucket, Key=f\"{prefix}/data/tracks_new.csv\"\n",
+    ")\n",
+    "s3_client.upload_file(\n",
+    "    Filename=\"./data/ratings_new.csv\", Bucket=bucket, Key=f\"{prefix}/data/ratings_new.csv\"\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "conda_python3",
+   "language": "python",
+   "name": "conda_python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}