In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature. With a minimal amount of code changes, SageMaker Debugger generates a comprehensive report outlining key information that you can use to evaluate and improve the model.
This post shows you an end-to-end example of training an XGBoost model on Sagemaker and how to enable the automatic XGBoost report functionality in Sagemaker Debugger to quickly and easily evaluate model performance and identify areas of improvement for your model. Even if you don’t have a lot of data science experience, you can still gauge how well the model performs and identify areas of improvement based on information provided by the report. The code from this post is available in the GitHub repo.
For this example, we use the dataset from the Kaggle ATLAS Higgs Boson Machine Learning Challenge 2014. With this dataset, we train a machine learning (ML) model to automatically classify Higgs Boson events from others (such as background noise) generated from simulated proton-proton collisions in CERN’s Large Hadron Collider. The data can be obtained directly from CERN. Let’s go through the steps of obtaining the data and configuring the training job. You can follow along with a Jupyter notebook.
- We start with the relevant imports:
- Then we set up variables that we later need to configure the SageMaker training job:
- We obtain data and prepare it for training:
Setting up a training job with XGBoost training report
We only need to make one code change to the typical process for launching a training job: adding the
create_xgboost_report rule to the Estimator. SageMaker takes care of the rest. A companion SageMaker processing job spins up to analyze the XGBoost model and produce the report. This analysis is done at no additional cost. See the following additional code:
Analyzing models with the XGBoost training report
When the training job is complete, SageMaker automatically starts the processing job to generate the XGBoost report. We write a few lines of code to check the status of the processing job. When it’s complete, we download it to our local drive for further review. The following code downloads the report upon its completion, and provides a hyperlink directly within the notebook for easy viewing:
Before we dive into the training report, let’s take a quick look at the SageMaker Debugger report, which by default is generated after every training job. This report provides key metrics around resource utilization such as network, I/O, and CPU. In the following example, we can see the median CPU utilization was at around 55% while memory utilization was consistently under 5%. This tells us that we can reduce costs by utilizing a smaller training instance.
Now let’s dive into the training report. SageMaker Debugger automatically generates the following key insights on our model:
- Distribution of labels – Detects imbalanced datasets
- Loss graph – Detects over-fitting or over training
- Feature importance metrics – Identifies redundant or uninformative features
- Confusion matrix and evaluation metrics – Evaluates performance at the individual class level and identifies concentrations of errors
- Accuracy rate per iteration – Shows how accuracy improved for each class over each round of boosting
- Receiver operating characteristic curve – Shows how the model performs under different probability thresholds
- Distribution of residuals – Helps determine if residuals are a result of random error or missing information
We pick a few items from the report for demonstration purposes.
Distribution of true labels of the dataset
This visualization shows the distribution of labeled classes (for classification) or values (for regression) in your original dataset. An imbalanced dataset could result in poor predictive performance unless properly handled. In this particular example, there’s a slight imbalance between the negative and positive label.
Loss vs. step graph
This visualization compares the loss from the training dataset against the validation dataset. For this particular model, it looks like this model is over-fitting on the training set because the validation error remains relatively flat after about 30 boosting rounds, even though the error on the training loss continues to improve.
This visualization shows you feature importance by weight, gain, and coverage. Gain, which measures the relative contribution of each feature, is typically the most relevant one for most use cases. For this particular model, we see that a handful of features provide the bulk of the contribution, while a large number contribute little to no gain to the model’s predictive performance. It’s usually a good practice to drop uninformative features from the model because they add noise and may result in over-fitting.
Confusion matrix and ROC curve
There are a number of additional visualizations that show you the common things data scientists often look at, such as the confusion matrix, ROC curve, and F1 score. For more information, see Debugger XGBoost Training Report Walkthrough.
From the following confusion matrix, we can see that the model does a better job at predicting for class 0 than class 1. And this can be explained by the imbalanced label distribution we showed at the beginning (there are more instances for class 0 than class 1). One ramification is making the label distribution more balanced via data resampling techniques.
SageMaker Debugger automatically generates and reports the performance metrics such as F1 score and accuracy. You can also see a classification report, such as the following.
From the training report’s outputs, we can see several areas where the model can be fine-tuned to improve performance, notably the following:
- The loss vs. step graph indicates that the validation error stopped improving after about 30 rounds, so we can reduce the number of boosting rounds or enable early stopping to mitigate over-training.
- The feature importance graph shows a large number of uninformative features that could potentially be removed to reduce over-fitting and improve predictive performance on unseen datasets.
- Based on the confusion matrix and the classification report, the recall score is somewhat low, meaning we’ve misclassified a large number of signal events. Tuning the
scale_pos_weightparameter to adjust for the imbalance in the dataset could help improve this.
In this post, we generated an XGBoost training report and profiler report using SageMaker Debugger. With these, we got reports for both the model performance and the resource utilization during training automatically. We then walked through the XGBoost training report and identified a number of issues that we can alleviate with some hyperparameter tuning.
About the Authors
Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.
Lu Huang is a Senior Product Manager on the AWS Deep Engine team, managing Sagemaker Debugger.
Satadal Bhattacharjee is Principal Product Manager at AWS AI. He leads the machine learning engine PM team on projects such as SageMaker and optimizes machine learning frameworks such as TensorFlow, PyTorch, and MXNet.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
Nihal Harish is an engineer at AWS AI. He loves working at the intersection of distributed systems and machine learning. Outside of work, he enjoys long distance running and playing tennis.