Data Preparation with SageMaker Data Wrangler (Part 2)

aws

sagemaker

A detailed guide on AWS SageMaker Data Wrangler to prepare data for machine learning models. This is a five parts series where we will prepare, import, explore, process, and export data using AWS Data Wrangler. You are reading Part 2:Import data from multiple sources using Data Wrangler.

Published

May 23, 2022

Enviornment

This notebook is prepared with Amazon SageMaker Studio using Python 3 (Data Science) Kernel and ml.t3.medium instance.

About

This is a detailed guide on using AWS SageMaker Data Wrangler service to prepare data for machine learning models. SageMaker Data Wrangler is a multipurpose tool with which you can * import data from multiple sources * explore data with visualizations * apply transformations * export data for ml training

This guide is divided into five parts * Part 1: Prepare synthetic data and place it on multiple sources * Part 2: Import data from multiple sources using Data Wrangler (You are here) * Part 3: Explore data with Data Wrangler visualizations * Part 4: Preprocess data using Data Wrangler * Part 5: Export data for ML training

Part 2: Import data from multiple sources using Data Wrangler

In this post, we will create SageMaker Data Wrangler Flow pipeline to import data from multiple sources. Once data is imported, we will then add a step to join the data into a single dataset that can be used for training ML models.

Launch SageMaker Data Wrangler Flow

Create a new Data Wrangler flow by clicking on the main menu tabs File > New > Data Wrangler Flow.

Once launched SageMaker may take a minute to initialize a new flow. The reason for this is SageMaker will launch a separate machine in the background ml.m5.4xlarge with 16vCPU and 64 GiB memory for processing flow files. A flow file is a JSON file that just captures all the steps performed from the Flow UI console. When you execute the flow, the Flow engine parses this file and performs all the steps. Once a new flow file is available, rename it to customer-churn.flow.

Import data from sources

First, we will create a flow to import data (created in the part-1 post) from S3 bucket. For this from the flow UI click on Amazon S3 bucket. From the next window select the bucket name S3://sagemaker-us-east-1-801598032724. In your case, it could be different where you have stored the data. From the UI select the filename “telco_churn_customer_info.csv” and click Import

Once the data is imported repeat the steps for the filename “telco_churn_account_info.csv”. If you are not seeing the “import from S3 bucket” option on the UI then check the flow UI and click on the ‘Import’ tab option. Once both files are imported, your Data Flow tab will look similar to this

Now that we have imported data from S3, we can now work on importing data from the Athena database. For this from the Flow UI Import tab click on Amazon Athena option. From the next UI select AwsDataCatalog Data catalog option. For Databases drop down select telco_db and in the query pane write the below query.

select * from telco_churn_utility

You can also preview the data by clicking on the table preview option. Once satisfied with the results click ‘Import’. When asked about the database name write telco_churn_utility

At this point, you will find all three tables imported in Data Flow UI. Against each table, a plus sign (+) will appear that you can use to add any transformations you want to apply on each table.

for telco_churn_customer_info click on the plus sign and then select ‘Edit’ to change data types.

We will add the following transformations * Change Area Code from Long to String * Click Preview * Then click Apply

Similarly for telco_churn_account_info.csv edit data types as * Change Account Length to Long * Change Int’l Plan and VMail Plan to Bool * Click Preview and then click Apply

For telco_churn_utility.csv edit data types as * Change custserv_calls to Long * Click Preview and then click Apply

At this point, we have imported the data from all three sources and have also properly transformed their column types.

Joining Tables

Now we will join all three tables to get a full dataset. For this from the Flow UI Data flow click on the plus sign next to customer_info data type and this time select ‘Join’. From the new window select account_info as the right dataset and click Configure

From the next screen select * Join Type = Full Outer * Columns Left = CustomerID * Columns Right = CustomerID * Click Preview and then Add

A new join step will appear on the Data Flow UI. Click on the plus sign next to it and repeat the steps for utility table

Join Type = Full Outer
Columns Left = CustomerID_0
Columns Right = CustomerID
Click Preview and then Add

Summary

At this point, we have all the tables joined together. The customer-churn.flow created is available on the GitHub here. In the next post, we will clean duplicate columns and create some visualizations to analyze the data.