Data Preparation with SageMaker Data Wrangler (Part 1)

aws

sagemaker

A detailed guide on AWS SageMaker Data Wrangler to prepare data for machine learning models. This is a five parts series where we will prepare, import, explore, process, and export data using AWS Data Wrangler. You are reading Part 1:Prepare synthetic data and place it on multiple sources.

Published

May 17, 2022

Enviornment

This notebook is prepared with Amazon SageMaker Studio using Python 3 (Data Science) Kernel and ml.t3.medium instance.

About

This is a detailed guide on using AWS SageMaker Data Wrangler service to prepare data for machine learning models. SageMaker Data Wrangler is a multipurpose tool with which you can * import data from multiple sources * explore data with visualizations * apply transformations * export data for ml training

This guide is divided into five parts * Part 1: Prepare synthetic data and place it on multiple sources (You are here) * Part 2: Import data from multiple sources using Data Wrangler * Part 3: Explore data with Data Wrangler visualizations * Part 4: Preprocess data using Data Wrangler * Part 5: Export data for ML training

Credits

Getting Started with Amazon SageMaker Studio book by Michael Hsieh. Michael Hsieh is a senior AI/machine learning (ML) solutions architect at Amazon Web Services. He creates and evangelizes for ML solutions centered around Amazon SageMaker. He also works with enterprise customers to advance their ML journeys.

Part 1: Prepare synthetic data and place it on multiple sources

Let’s prepare some dataset and place it on the S3 bucket and AWS Glue tables. Then we will use Data Wrangler to pull and join data from these two sources. The idea is to simulate some real project challenges where data is not coming from a single source but is distributed in multiple stores, and is in different formats. It is usually the preprocessing pipeline job to get data from these sources and join and preprocess it.

Data

Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes. After all, predicting the future is a tricky business! But we’ll learn how to deal with prediction errors.

The dataset we use is publicly available and was mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets (Jafari-Marandi, R., Denton, J., Idris, A., Smith, B. K., & Keramati, A. (2020).

Preparation

##
# install aws data wrangler package
# restart kernel after installation
# more on this package later in the notebook.
!pip install -q awswrangler

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.

## import required libraries
import pandas as pd
import sagemaker

sess = sagemaker.Session()
prefix = 'myblog/demo-customer-churn'

!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt ./

download: s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt to ./churn.txt

df = pd.read_csv("./churn.txt")

# make 'CustomerID' column from the index
df['CustomerID']=df.index

pd.set_option("display.max_columns", 500)
df.head(10)

	State	Account Length	Area Code	Phone	Int'l Plan	VMail Plan	VMail Message	Day Mins	Day Calls	Day Charge	Eve Mins	Eve Calls	Eve Charge	Night Mins	Night Calls	Night Charge	Intl Mins	Intl Calls	Intl Charge	CustServ Calls	Churn?	CustomerID
0	PA	163	806	403-2562	no	yes	300	8.162204	3	7.579174	3.933035	4	6.508639	4.065759	100	5.111624	4.928160	6	5.673203	3	True.	0
1	SC	15	836	158-8416	yes	no	0	10.018993	4	4.226289	2.325005	0	9.972592	7.141040	200	6.436188	3.221748	6	2.559749	8	False.	1
2	MO	131	777	896-6253	no	yes	300	4.708490	3	4.768160	4.537466	3	4.566715	5.363235	100	5.142451	7.139023	2	6.254157	4	False.	2
3	WY	75	878	817-5729	yes	yes	700	1.268734	3	2.567642	2.528748	5	2.333624	3.773586	450	3.814413	2.245779	6	1.080692	6	False.	3
4	WY	146	878	450-4942	yes	no	0	2.696177	3	5.908916	6.015337	3	3.670408	3.751673	250	2.796812	6.905545	4	7.134343	6	True.	4
5	VA	83	866	454-9110	no	no	0	3.634776	7	4.804892	6.051944	5	5.278437	2.937880	300	4.817958	4.948816	4	5.135323	5	False.	5
6	IN	140	737	331-5751	yes	no	0	3.229420	4	3.165082	2.440153	8	0.264543	2.352274	300	3.869176	5.393439	4	1.784765	4	False.	6
7	LA	54	766	871-3612	no	no	0	0.567920	6	1.950098	4.507027	0	4.473086	0.688785	400	6.132137	5.012747	5	0.417421	8	False.	7
8	MO	195	777	249-5723	yes	no	0	5.811116	6	4.331065	8.104126	2	4.475034	4.208352	250	5.974575	4.750153	7	3.320311	7	True.	8
9	AL	104	657	767-7682	yes	no	0	2.714430	7	5.138669	8.529944	6	3.321121	2.342177	300	4.328966	3.433554	5	5.677058	4	False.	9

df.shape

(5000, 22)

By modern standards, it’s a relatively small dataset, with only 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:

State: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
Account Length: the number of days that this account has been active
Area Code: the three-digit area code of the corresponding customer’s phone number
Phone: the remaining seven-digit phone number
Int’l Plan: whether the customer has an international calling plan: yes/no
VMail Plan: whether the customer has a voice mail feature: yes/no
VMail Message: the average number of voice mail messages per month
Day Mins: the total number of calling minutes used during the day
Day Calls: the total number of calls placed during the day
Day Charge: the billed cost of daytime calls
Eve Mins, Eve Calls, Eve Charge: the billed cost for calls placed during the evening
Night Mins, Night Calls, Night Charge: the billed cost for calls placed during nighttime
Intl Mins, Intl Calls, Intl Charge: the billed cost for international calls
CustServ Calls: the number of calls placed to Customer Service
Churn?: whether the customer left the service: true/false

The last attribute, Churn?, is known as the target attribute: the attribute that we want the ML model to predict. Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.

We have our dataset. Now we will split this dataset into three subsets * customer: customer data, and place it as a CSV file on the S3 bucket * account: accounts data, and place it as CSV on the same S3 bucket * utility: utility data, and place it as Glue tables

customer_columns = ['CustomerID', 'State', 'Area Code', 'Phone']
account_columns = ['CustomerID', 'Account Length', "Int'l Plan", 'VMail Plan', 'Churn?']
utility_columns = ['CustomerID', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 
                   'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 
                   'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls']

We will use the default bucket associated with our SageMaker session. You may use any other bucket with proper access permissions.

bucket = sess.default_bucket()
bucket

'sagemaker-us-east-1-801598032724'

Next, we will use AWS Data Wrangler Python package (awswrangler) to create an AWS Glue database.

awswrangler is an open source Python library maintained by AWS team, as is defined as

An AWS Professional Service open source python initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services. Easy integration with Athena, Glue, Redshift, Timestream, OpenSearch, Neptune, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

You may read more about this library here * Documentation: https://aws-data-wrangler.readthedocs.io/en/stable/what.html * Github repo: https://github.com/awslabs/aws-data-wrangler

Please note that AWS SageMaker session needs some additional AWS Glue permissions to create a database. If you get an error while creating a Glue database in following steps then add those permissions.

Error: AccessDeniedException: An error occurred (AccessDeniedException) when calling the GetDatabase operation: User: arn:aws:sts::801598032724:assumed-role/AmazonSageMaker-ExecutionRole-20220516T161743/SageMaker is not authorized to perform: glue:GetDatabase on resource: arn:aws:glue:us-east-1:801598032724:database/telco_db because no identity-based policy allows the glue:GetDatabase action

Fix: Go to your SageMaker Execution Role and add permission AWSGlueConsoleFullAccess

##
# define the Glue DB name
db_name = 'telco_db'

##
import awswrangler as wr

# get all the existing Glue db list
databases = wr.catalog.databases()

# print existing db names
print("*** existing databases ***\n")
print(databases)

# if our db does not exist then create it
if db_name not in databases.values:
    wr.catalog.create_database(db_name, description = 'Demo DB for telco churn dataset')
    print("\n*** existing + new databases ***\n")
    print(wr.catalog.databases())
else:
    print(f"Database {db_name} already exists")

*** existing databases ***

                  Database Description
0  sagemaker_data_wrangler            
1     sagemaker_processing            

*** existing + new databases ***

                  Database                      Description
0  sagemaker_data_wrangler                                 
1     sagemaker_processing                                 
2                 telco_db  Demo DB for telco churn dataset

##
# in case you want to delete a database using this notebook
# wr.catalog.delete_database(db_name)

Similarly you can go to AWS Glue console to see that the new database has been created.

Now we will place the three data subsets into their respective locations.

suffix = ['customer_info', 'account_info', 'utility']
for i, columns in enumerate([customer_columns, account_columns, utility_columns]):
    
    # get the data subset
    df_tmp = df[columns]

    # prepare filename and output path
    fname = 'telco_churn_%s' % suffix[i]
    outputpath = f's3://{bucket}/{prefix}/data/{fname}'
    
    print(f"\n*** working on {suffix[i]}***")
    print(f"filename: {fname}")
    print(f"output path: {outputpath}")
    
    if i > 1: # for utility
        wr.s3.to_csv(
            df=df_tmp,
            path=outputpath,
            dataset=True,
            database=db_name,  # Athena/Glue database
            table=fname,  # Athena/Glue table
            index=False,
            mode='overwrite')
    else: # for customer and account
        wr.s3.to_csv(
            df=df_tmp,
            path=f'{outputpath}.csv',
            index=False)


*** working on customer_info***
filename: telco_churn_customer_info
output path: s3://sagemaker-us-east-1-801598032724/myblog/demo-customer-churn/data/telco_churn_customer_info

*** working on account_info***
filename: telco_churn_account_info
output path: s3://sagemaker-us-east-1-801598032724/myblog/demo-customer-churn/data/telco_churn_account_info

*** working on utility***
filename: telco_churn_utility
output path: s3://sagemaker-us-east-1-801598032724/myblog/demo-customer-churn/data/telco_churn_utility

We can verify the uploaded data from the S3 bucket.

Similarly, from Glue console we can verify that the utility table has been created.

If you want to remain within the notebook and do the verification then that can also be done.

##
# list s3 objects
wr.s3.list_objects('s3://sagemaker-us-east-1-801598032724/myblog/demo-customer-churn/data/')

['s3://sagemaker-us-east-1-801598032724/myblog/demo-customer-churn/data/telco_churn_account_info.csv',
 's3://sagemaker-us-east-1-801598032724/myblog/demo-customer-churn/data/telco_churn_customer_info.csv',
 's3://sagemaker-us-east-1-801598032724/myblog/demo-customer-churn/data/telco_churn_utility/b4003acdf33e48ce989401e92146923c.csv']

##
# list glue catalog tables
wr.catalog.tables()

	Database	Table	Description	TableType	Columns	Partitions
0	telco_db	telco_churn_utility		EXTERNAL_TABLE	customerid, vmail_message, day_mins, day_calls...

Summary

At this point we have our dataset ready in AWS S3 and Glue, and in the next part we will use AWS SageMaker Data Wrangler to import and join this data.