Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atqy/refactor music recommendation #10

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
419 changes: 0 additions & 419 deletions end_to_end/music_recommendation/00_overview_arch_data.ipynb

This file was deleted.

301 changes: 301 additions & 0 deletions end_to_end/music_recommendation/01_data_exploration.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Music Recommender Data Exploration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"\n",
"## Background\n",
"\n",
"This notebook is part of a notebook series that goes through the ML lifecycle and shows how we can build a Music Recommender System using a combination of SageMaker services and features. In this notebook, we will be focusing on exploring the data. It is the first notebook in a series of notebooks. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the [README.md](README.md) for more information about this use case implement of this sequence of notebooks. \n",
"\n",
"1. [Music Recommender Data Exploration](01_data_exploration.ipynb) (current notebook)\n",
"1. [Music Recommender Data Preparation with SageMaker Feature Store and SageMaker Data Wrangler](02_export_feature_groups.ipynb)\n",
"1. [Train, Deploy, and Monitor the Music Recommender Model using SageMaker SDK](03_train_deploy_debugger_explain_monitor_registry.ipynb)\n",
"\n",
"----\n",
"\n",
"## Contents\n",
"1. [Prereqs: Get Data](#Prereqs:-Get-Data)\n",
"1. [Update the Data Source in the .flow File](#Update-the-Data-Source-in-the-.flow-File)\n",
"1. [Explore the Data](#Explore-the-Data)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import pprint\n",
"\n",
"sys.path.insert(1, \"./code\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# update pandas to avoid data type issues in older 1.0 version\n",
"!pip install pandas --upgrade --quiet\n",
"import pandas as pd\n",
"\n",
"print(pd.__version__)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create data folder\n",
"!mkdir data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"%matplotlib inline\n",
"\n",
"import json\n",
"import sagemaker\n",
"import boto3\n",
"import os\n",
"\n",
"# Sagemaker session\n",
"sess = sagemaker.Session()\n",
"# get session bucket name\n",
"bucket = sess.default_bucket()\n",
"# bucket prefix or the subfolder for everything we produce\n",
"prefix = \"music-recommendation\"\n",
"# s3 client\n",
"s3_client = boto3.client(\"s3\")\n",
"\n",
"print(f\"this is your default SageMaker Studio bucket name: {bucket}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prereqs: Get Data \n",
"\n",
"----\n",
"\n",
"Here we will download the music data from a public S3 bucket that we'll be using for this demo and uploads it to your default S3 bucket that was created for you when you initially created a SageMaker Studio workspace. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from demo_helpers import get_data, get_model, update_data_sources"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# public S3 bucket that contains our music data\n",
"s3_bucket_music_data = \"s3://sagemaker-sample-files/datasets/tabular/synthetic-music\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_data_paths = get_data(\n",
" s3_client,\n",
" [f\"{s3_bucket_music_data}/tracks.csv\", f\"{s3_bucket_music_data}/ratings.csv\"],\n",
" bucket,\n",
" prefix,\n",
" sample_data=0.70,\n",
")\n",
"print(new_data_paths)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# these are the new file paths located on your SageMaker Studio default s3 storage bucket\n",
"tracks_data_source = f\"s3://{bucket}/{prefix}/tracks.csv\"\n",
"ratings_data_source = f\"s3://{bucket}/{prefix}/ratings.csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Update the Data Source in the .flow File\n",
"\n",
"----\n",
"\n",
"The `01_music_dataprep.flow` file is a JSON file containing instructions for where to find your data sources and how to transform the data. We'll be updating the object telling Data Wrangler where to find the input data on S3. We will set this to your default S3 bucket. With this update to the `.flow` file it now points to your new S3 bucket as the data source used by SageMaker Data Wrangler.\n",
"\n",
"Make sure the `.flow` file is closed before running this next step or it won't update the new s3 file locations in the file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"update_data_sources(\"01_music_dataprep.flow\", tracks_data_source, ratings_data_source)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the Data\n",
"\n",
"----"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracks = pd.read_csv(\"./data/tracks.csv\")\n",
"ratings = pd.read_csv(\"./data/ratings.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracks.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"{:,} different songs/tracks\".format(tracks[\"trackId\"].nunique()))\n",
"print(\"{:,} users\".format(ratings[\"userId\"].nunique()))\n",
"print(\"{:,} user rating events\".format(ratings[\"ratingEventId\"].nunique()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracks.groupby(\"genre\")[\"genre\"].count().plot.bar(title=\"Tracks by Genre\");"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ratings[[\"ratingEventId\", \"userId\"]].plot.hist(\n",
" by=\"userId\", bins=50, title=\"Distribution of # of Ratings by User\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create some new data to ingest later"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracks_new = tracks[:300]\n",
"ratings_new = ratings[:1000]\n",
"\n",
"# export dataframes to csv\n",
"tracks_new.to_csv(\"./data/tracks_new.csv\", index=False)\n",
"ratings_new.to_csv(\"./data/ratings_new.csv\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s3_client.upload_file(\n",
" Filename=\"./data/tracks_new.csv\", Bucket=bucket, Key=f\"{prefix}/data/tracks_new.csv\"\n",
")\n",
"s3_client.upload_file(\n",
" Filename=\"./data/ratings_new.csv\", Bucket=bucket, Key=f\"{prefix}/data/ratings_new.csv\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading