使用 Amazon SageMaker Pipelines 构建、调整和部署端到端客户流失预测模型

由柏拉图重新发布

关注： 0

能够预测特定客户面临很高的流失风险，同时仍有时间采取措施，这对于每个在线企业来说都是一个巨大的潜在收入来源。根据行业和业务目标，问题陈述可以是多层的。以下是基于该策略的一些业务目标：

这篇文章讨论了如何在每个步骤中编排端到端流失预测模型：数据准备、试验基线模型和超参数优化 (HPO)、训练和调整以及注册最佳模型。您可以管理您的亚马逊SageMaker 训练和推理工作流程使用亚马逊SageMaker Studio 和 SageMaker Python SDK。 SageMaker 提供创建高质量数据科学解决方案所需的所有工具。

SageMaker 通过汇集专为 ML 构建的广泛功能，帮助数据科学家和开发人员快速准备、构建、训练和部署高质量的机器学习 (ML) 模型。

Studio 提供了一个基于 Web 的单一可视化界面，您可以在其中执行所有 ML 开发步骤，从而将数据科学团队的工作效率提高多达 10 倍。

Amazon SageMaker管道是一种利用直接 SageMaker 集成构建 ML 管道的工具。借助 Pipelines，您可以轻松自动化构建 ML 模型的步骤，在模型注册表中对模型进行编目，并使用 SageMaker 项目中提供的多个模板之一来设置端到端的持续集成和持续交付 (CI/CD)。 -大规模结束机器学习生命周期。

模型训练完成后，就可以使用亚马逊SageMaker澄清识别和限制偏见并向业务利益相关者解释预测。您可以与下游目标活动的业务和技术团队共享这些自动化报告，或确定作为客户终身价值的关键差异化因素的功能。

在本文结束时，您应该拥有足够的信息，可以通过管道成功使用此端到端模板来训练、调整和部署您自己的预测分析用例。完整的说明可在 GitHub回购.

在此解决方案中，您的切入点是用于快速实验的 Studio 集成开发环境 (IDE)。 Studio 提供了一个管理端到端管道体验的环境。使用 Studio，您可以绕过 AWS管理控制台用于您的整个工作流程管理。有关从 Studio 管理管道的更多信息，请参阅在 SageMaker Studio 中查看、跟踪和执行 SageMaker 管道.

下图说明了数据科学工作流程的高级架构。

创建 Studio 域后，选择您的用户名并选择 开放工作室。将打开一个基于 Web 的 IDE，允许您存储和收集所需的所有内容 - 无论是代码、笔记本、数据集、设置还是项目文件夹。

Pipelines 直接与 SageMaker 集成，因此您无需与任何其他 AWS 服务交互。您也不需要管理任何资源，因为 Pipelines 是一项完全托管的服务，这意味着它会为您创建和管理资源。有关各种 SageMaker 组件（均为独立 Python API 以及 Studio 集成组件）的更多信息，请参阅 SageMaker 服务页面.

对于此用例，您可以使用以下组件来实现全自动模型开发过程：

SageMaker 管道是由 JSON 管道定义定义的一系列相互关联的步骤。此管道定义使用有向无环图 (DAG) 对管道进行编码。此 DAG 提供有关管道每个步骤的要求和关系的信息。管道 DAG 的结构由步骤之间的数据依赖关系决定。当一个步骤的输出的属性作为输入传递给另一个步骤时，就会创建这些数据依赖关系。

在这篇文章中，我们的用例是一个经典的机器学习问题，旨在了解我们可以采取哪些基于消费者行为的营销策略来提高给定零售商店的客户保留率。下图说明了流失预测用例的完整 ML 工作流程。

让我们详细了解一下加速的 ML 工作流程开发过程。

要跟随这篇文章，您需要下载并保存样本数据集默认情况下亚马逊简单存储服务与您的 SageMaker 会话关联的 (Amazon S3) 存储桶，以及您选择的 S3 存储桶。为了快速实验或基线模型构建，您可以将数据集的副本保存在您的主目录下亚马逊弹性文件系统 (Amazon EFS) 并关注 Jupyter 笔记本 Customer_Churn_Modeling.ipynb.

以下屏幕截图显示了目标变量保留为 1（如果假定客户处于活动状态）的示例集，否则为 0。

在 Studio 笔记本中运行以下代码来预处理数据集并将其上传到您自己的 S3 存储桶：

import boto3
import pandas as pd
import numpy as np ## Preprocess the dataset
def preprocess_data(file_path):  df = pd.read_csv(file_path) ## Convert to datetime columns df["firstorder"]=pd.to_datetime(df["firstorder"],errors='coerce') df["lastorder"] = pd.to_datetime(df["lastorder"],errors='coerce') ## Drop Rows with null values df = df.dropna()    ## Create Column which gives the days between the last order and the first order df["first_last_days_diff"] = (df['lastorder']-df['firstorder']).dt.days ## Create Column which gives the days between when the customer record was created and the first order df['created'] = pd.to_datetime(df['created']) df['created_first_days_diff']=(df['created']-df['firstorder']).dt.days ## Drop Columns df.drop(['custid','created','firstorder','lastorder'],axis=1,inplace=True) ## Apply one hot encoding on favday and city columns df = pd.get_dummies(df,prefix=['favday','city'],columns=['favday','city']) return df ## Set the required configurations
model_name = "churn_model"
env = "dev"
## S3 Bucket
default_bucket = "customer-churn-sm-pipeline"
## Preprocess the dataset
storedata = preprocess_data(f"s3://{default_bucket}/data/storedata_total.csv")

借助具有弹性计算功能的 Studio 笔记本，您现在可以轻松运行多个训练和调整作业。对于此用例，您使用 SageMaker 内置 XGBoost 算法和 SageMaker HPO，目标函数为 "binary:logistic" 和 "eval_metric":"auc".

def split_datasets(df): y=df.pop("retained") X_pre = df y_pre = y.to_numpy().reshape(len(y),1) feature_names = list(X_pre.columns) X= np.concatenate((y_pre,X_pre),axis=1) np.random.shuffle(X) train,validation,test=np.split(X,[int(.7*len(X)),int(.85*len(X))]) return feature_names,train,validation,test # Split dataset
feature_names,train,validation,test = split_datasets(storedata) # Save datasets in Amazon S3
pd.DataFrame(train).to_csv(f"s3://{default_bucket}/data/train/train.csv",header=False,index=False)
pd.DataFrame(validation).to_csv(f"s3://{default_bucket}/data/validation/validation.csv",header=False,index=False)
pd.DataFrame(test).to_csv(f"s3://{default_bucket}/data/test/test.csv",header=False,index=False)

使用以下代码训练、调整并找到最佳候选模型：

# Training and Validation Input for SageMaker Training job
s3_input_train = TrainingInput( s3_data=f"s3://{default_bucket}/data/train/",content_type="csv")
s3_input_validation = TrainingInput( s3_data=f"s3://{default_bucket}/data/validation/",content_type="csv") # Hyperparameter used
fixed_hyperparameters = { "eval_metric":"auc", "objective":"binary:logistic", "num_round":"100", "rate_drop":"0.3", "tweedie_variance_power":"1.4"
} # Use the built-in SageMaker algorithm sess = sagemaker.Session()
container = sagemaker.image_uris.retrieve("xgboost",region,"0.90-2") estimator = sagemaker.estimator.Estimator( container, role, instance_count=1, hyperparameters=fixed_hyperparameters, instance_type="ml.m4.xlarge", output_path="s3://{}/output".format(default_bucket), sagemaker_session=sagemaker_session
) hyperparameter_ranges = { "eta": ContinuousParameter(0, 1), "min_child_weight": ContinuousParameter(1, 10), "alpha": ContinuousParameter(0, 2), "max_depth": IntegerParameter(1, 10),
}
objective_metric_name = "validation:auc"
tuner = HyperparameterTuner(
estimator, objective_metric_name,
hyperparameter_ranges,max_jobs=10,max_parallel_jobs=2) # Tune
tuner.fit({ "train":s3_input_train, "validation":s3_input_validation },include_cls_metadata=False) ## Explore the best model generated
tuning_job_result = boto3.client("sagemaker").describe_hyper_parameter_tuning_job( HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
) job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" %job_count)
## 10 training jobs have completed ## Get the best training job from pprint import pprint
if tuning_job_result.get("BestTrainingJob",None): print("Best Model found so far:") pprint(tuning_job_result["BestTrainingJob"])
else: print("No training jobs have reported results yet.")

ML-4931-超参数调整

建立基线后，您可以使用 Amazon SageMaker调试器用于离线模型分析。 Debugger 是 SageMaker 中的一项功能，可自动提供对模型训练过程的可见性，以进行实时和离线分析。 Debugger 会定期保存模型内部状态，您可以在训练过程中实时分析，训练完成后可以离线分析。对于此用例，您将使用可解释性工具 SHAP（SHApley Additive exPlanation）以及 SHAP 与调试器的本机集成。参考以下笔记本进行详细分析。

以下摘要图解释了预测变量与目标变量的正相关和负相关关系。例如，这里的 top 变量， esent，定义为发送的电子邮件数量。该图由训练集中的所有数据点组成。蓝色表示将最终输出拖至类别 0，粉色表示类别 1。关键影响特征按降序排列。

ML-4931-沙普利图

现在您可以继续执行 ML 工作流程的部署和管理步骤。

开发并自动化工作流程

让我们从项目结构开始：

/客户流失模型 - 项目名
/数据 – 数据集
/管道 – SageMaker 管道组件的代码
SageMaker_Pipelines_project.ipynb – 允许您创建和运行 ML 工作流程
Customer_Churn_Modeling.ipynb – 基线模型开发笔记本

ML-4931-项目结构

下 <project-name>/pipelines/customerchurn，您可以看到以下 Python 脚本：

预处理.py – 与 SageMaker Processing 集成以进行特征工程
评估.py – 允许模型指标计算，在本例中为 auc_score
生成配置.py – 允许下游 Clarify 作业所需的动态配置，以实现模型的可解释性
管道.py – Pipelines ML 工作流程的模板化代码

ML-4931-代码结构

让我们来看看 DAG 中的每一步以及它们是如何运行的。这些步骤与我们第一次准备数据时类似。

使用以下代码执行数据准备：

# processing step for feature engineering sklearn_processor = SKLearnProcessor( framework_version="0.23-1", instance_type=processing_instance_type, instance_count=processing_instance_count, sagemaker_session=sagemaker_session, role=role, ) step_process = ProcessingStep( name="ChurnModelProcess", processor=sklearn_processor, inputs=[ ProcessingInput(source=input_data, destination="/opt/ml/processing/input"), ], outputs=[ ProcessingOutput(output_name="train", source="/opt/ml/processing/train", destination=f"s3://{default_bucket}/output/train" ), ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation", destination=f"s3://{default_bucket}/output/validation"), ProcessingOutput(output_name="test", source="/opt/ml/processing/test", destination=f"s3://{default_bucket}/output/test") ], code=f"s3://{default_bucket}/input/code/preprocess.py", )

训练、调整并找到最佳候选模型：

# training step for generating model artifacts model_path = f"s3://{default_bucket}/output" image_uri = sagemaker.image_uris.retrieve( framework="xgboost", region=region, version="1.0-1", py_version="py3", instance_type=training_instance_type, ) fixed_hyperparameters = { "eval_metric":"auc", "objective":"binary:logistic", "num_round":"100", "rate_drop":"0.3", "tweedie_variance_power":"1.4" } xgb_train = Estimator( image_uri=image_uri, instance_type=training_instance_type, instance_count=1, hyperparameters=fixed_hyperparameters, output_path=model_path, base_job_name=f"churn-train", sagemaker_session=sagemaker_session, role=role, ) hyperparameter_ranges = { "eta": ContinuousParameter(0, 1), "min_child_weight": ContinuousParameter(1, 10), "alpha": ContinuousParameter(0, 2), "max_depth": IntegerParameter(1, 10), } objective_metric_name = "validation:auc"

你可以添加一个模型调整步骤（TuningStep）在管道中，它会自动调用超参数调整作业（请参阅以下代码）。超参数调整通过使用算法和您指定的超参数范围在数据集上运行许多训练作业来找到模型的最佳版本。然后，您可以使用 RegisterModel 步骤将模型的最佳版本注册到模型注册表中。

## Direct Integration for HPO step_tuning = TuningStep( name = "ChurnHyperParameterTuning", tuner = HyperparameterTuner(xgb_train, objective_metric_name, hyperparameter_ranges, max_jobs=2, max_parallel_jobs=2), inputs={ "train": TrainingInput( s3_data=step_process.properties.ProcessingOutputConfig.Outputs[ "train" ].S3Output.S3Uri, content_type="text/csv", ), "validation": TrainingInput( s3_data=step_process.properties.ProcessingOutputConfig.Outputs[ "validation" ].S3Output.S3Uri, content_type="text/csv", ), }, )

ML-4931-SM_管道HPO

调整模型后，根据调整作业目标指标，您可以在编排工作流时使用分支逻辑。对于这篇文章，模型质量检查的条件步骤如下：

# condition step for evaluating model quality and branching execution cond_lte = ConditionGreaterThan( left=JsonGet( step=step_eval, property_file=evaluation_report, json_path="classification_metrics.auc_score.value" ), right=0.75, )

使用 RegisterModel 步骤注册最佳候选模型以进行批量评分：

step_register = RegisterModel( name="RegisterChurnModel", estimator=xgb_train, model_data=step_tuning.get_top_model_s3_uri(top_k=0,s3_bucket=default_bucket,prefix="output"), content_types=["text/csv"], response_types=["text/csv"], inference_instances=["ml.t2.medium", "ml.m5.large"], transform_instances=["ml.m5.large"], model_package_group_name=model_package_group_name, model_metrics=model_metrics, )

现在模型已经训练完毕，让我们看看 Clarify 如何帮助我们了解模型的预测基于哪些特征。您可以创建一个 analysis_config.json 使用每个工作流程运行动态文件 generate_config.py 公用事业。您可以对每个管道的配置文件进行版本控制和跟踪 runId 并将其存储在 Amazon S3 中以供进一步参考。初始化 dataconfig 和 modelconfig 文件如下：

data_config = sagemaker.clarify.DataConfig( s3_data_input_path=f's3://{args.default_bucket}/output/train/train.csv', s3_output_path=args.bias_report_output_path, label=0, headers= ['target','esent','eopenrate','eclickrate','avgorder','ordfreq','paperless','refill','doorstep','first_last_days_diff','created_first_days_diff','favday_Friday','favday_Monday','favday_Saturday','favday_Sunday','favday_Thursday','favday_Tuesday','favday_Wednesday','city_BLR','city_BOM','city_DEL','city_MAA'], dataset_type="text/csv", ) model_config = sagemaker.clarify.ModelConfig( model_name=args.modelname, instance_type=args.clarify_instance_type, instance_count=1, accept_type="text/csv", ) model_predicted_label_config = sagemaker.clarify.ModelPredictedLabelConfig(probability_threshold=0.5) bias_config = sagemaker.clarify.BiasConfig( label_values_or_threshold=[1], facet_name="doorstep", facet_values_or_threshold=[0], )

使用以下命令将澄清步骤添加为后处理作业后 sagemaker.clarify.SageMakerClarifyProcessor 在管道中，您可以看到每个管道运行的详细功能和偏差分析报告。

ML-4931-澄清报告

ML-4931-SM-UI-1

ML-4931-SM-UI-2

作为管道工作流程的最后一步，您可以使用 TransformStep 离线评分的步骤。通过在 transformer instance 和 TransformInput 与 batch_data 前面定义的管道参数：

# step to perform batch transformation transformer = Transformer( model_name=step_create_model.properties.ModelName, instance_type="ml.m5.xlarge", instance_count=1, output_path=f"s3://{default_bucket}/ChurnTransform" ) step_transform = TransformStep( name="ChurnTransform", transformer=transformer, inputs=TransformInput(data=batch_data,content_type="text/csv") )

最后，您可以通过选择来触发新的管道运行 开始执行 在 Studio IDE 界面上。

ML-4931-SMPipeline-执行

您还可以使用以下命令描述管道运行或启动管道笔记本。以下屏幕截图显示了我们的输出。

ML-4931-SMPipeline-DescribeExecution

您可以使用以下命令安排 SageMaker 模型构建管道运行亚马逊EventBridge。支持 SageMaker 模型构建管道作为 Amazon EventBridge 中的目标。这允许您根据事件总线中的任何事件触发管道运行。 EventBridge 使您能够自动化管道运行并自动响应训练作业或端点状态更改等事件。事件包括上传到您的 S3 存储桶的新文件、由于漂移而导致 SageMaker 端点状态发生变化，以及亚马逊简单通知服务 (Amazon SNS) 主题。

结论

这篇文章解释了如何将 SageMaker Pipelines 与其他内置 SageMaker 功能和 XGBoost 算法结合使用来开发、迭代和部署流失预测的最佳候选模型。有关实施此解决方案的说明，请参阅 GitHub回购。您还可以使用其他数据源克隆和扩展此解决方案，以进行模型再训练。我们鼓励您与您的 AWS 客户经理联系并讨论您的 ML 使用案例。

其他参考

有关其他信息，请参阅以下资源：

作者简介

加亚特里·加纳科塔 是 AWS 专业服务的机器学习工程师。她热衷于开发、部署和解释跨各个领域的 AI/ML 解决方案。在此之前，她作为数据科学家和机器学习工程师，在金融和零售领域的全球顶级公司领导了多项计划。她拥有科罗拉多大学博尔德分校的计算机科学硕士学位，专攻数据科学。

萨丽塔·乔希 是 AWS 专业服务的高级数据科学家，专注于为零售、保险、制造、旅游、生命科学、媒体和娱乐以及金融服务等行业的客户提供支持。她拥有多年担任顾问的经验，为许多行业和技术领域的客户提供建议，包括人工智能、机器学习、分析和 SAP。如今，她热情地与客户合作，大规模开发和实施机器学习和人工智能解决方案。