Boost your ML Workload Performance with Migration to Graviton-Powered Instances
In a quest for lower energy consumption and higher performance in CPU-based machine learning workloads, Graviton3 is AWS’s next-generation ARM-compatible server processor range.
However, is the hype with Graviton3 processors and their core differential benefits really worth it?
This article will help find the answers.
What are Graviton3 server processors? - An overview
AWS Graviton is a set of processors from AWS that provide a better price-performance combination and are comparatively energy efficient to the x86 counterparts.
AWS Graviton3 processors are optimized for ML workloads and provide twice the Single Instruction Multiple Data (SIMD) bandwidth while also supporting bfloat16. Combining the above two features, Graviton 3 delivers a performance that is three-times better than the older Graviton2 instances.
Graviton3 consumes 60% less energy on the performance side when compared to Amazon Elastic Compute Cloud (Amazon EC2) instances and helps reduce an organization’s carbon footprint and achieve sustainability goals.
Amazon SageMaker, the cloud machine-learning platform, added eight new Graviton2 and Graviton3-based machine learning (ML) instance families that offer customers more options for optimizing cost and performance when deploying their ML models on SageMaker.
Key Benefits of Graviton3
Graviton3 offers many advantages, namely:
- Graviton3 processors provide 3x better performance than AWS Graviton2 processors for ML workloads
- Due to higher efficiency, lesser instances are sufficient to run the same workload, thereby reducing the utilization of cloud resources
- Extensive software support
- Enhanced security for cloud applications
- Available with managed AWS Services and Amazon SageMaker
- Offers the best price-performance ratio for a broad range of workloads, including ML workloads
- Flexible coding, as most application languages and open-source software services support Graviton instances
SageMaker support for Graviton
SageMaker provides Graviton deep learning containers that are performance-optimized for the TensorFlow and PyTorch frameworks. These containers support generic deep and wide model-based inference use cases in addition to natural language processing, recommendations, and computer vision.
SageMaker offers containers for XGBoost and Scikit-learn classical ML frameworks. The containers are binary compatible across instances like c6g/m6g and c7g, offering smooth inference application migration.
C6g/m6g supports fp16 (half-precision float) and delivers better performance for compatible models than c5 instances. C7g boosts ML performance by doubling SIMD width and supporting bfloat-16 (bf16) for running models.
One can deploy bf16-trained or AMP (Automatic Mixed Precision)-trained models on c7g with Bfloat-16 support. Graviton has an ARM Compute Library (ACL) backend, which offers bfloat-16 kernels capable of accelerating the fp32 operators via fast math mode without the model quantization.
Recommended best practices
There are some best practices one can use when running ML-based workloads on AWS Graviton instances with AWS SageMaker.
Every vCPU is a physical core in the case of Graviton instances. The workload performance scaling is linear in this case with every vCPU addition. Hence, AWS recommends using a batch inference whenever the use case scenario permits. Batch inference enables efficient use of vCPUs by processing the batch in parallel on each physical core.
Suppose it is challenging to make a batch inference; in this case, you will need an optimal instance size for a given payload to ensure the OS thread scheduling overhead does not outweigh the computing power that comes with the additional vCPUs.
Using One DNN with ACL is recommended in order to get the most optimal backend inference, as TensorFlow has Eigen kernels by default.
The ARM-based TensorFlow package supports a couple of different backends – the default Eigen backend and the faster DNN.
Figure 2. ARM-based TensorFlow software stack with two backends (Eigen and oneDNN/ACL Source: ARM Community)
OneAPI Deep Neural Network Library (oneDNN) is a popular open-source, cross-platform performance library for deep learning applications. The library integrates with TensorFlow, PyTorch, and other frameworks. Compute Library (ACL) is an ARM-optimized library and has over 100 optimized machine-learning functions and includes multiple convolution algorithms.
You can enable the OneDNN backend and the bfloat-16 fast math mode while launching the container service:
docker run -p 8501:8501 –name tfserving_resnet \
–mount type=bind,source=/tmp/resnet,target=/models/resnet \
-e MODEL_NAME=resnet -e TF_ENABLE_ONEDNN_OPTS=1 \
-e DNNL_DEFAULT_FPMATH_MODE=BF16 -e -t tfs:mkl_aarch64
The preceding serving command hosts a standard resnet50 model with two critical configurations:
One can pass these to the inference container in the following way:
ModelName=“Your model name”,
ExecutionRoleArn=‘ARN for AmazonSageMaker-ExecutionRole’
Comparison of ML Benchmark AWS Graviton3 (c7g) vs AWS Graviton2 (c6g)
The Neoverse-V1 in AWS Graviton3 (c7g) has wider vector units and vastly improves the ML performance compared to the Neoverse-N1 in AWS Graviton2 (c6g).
The graphs below clearly show AWS c7g (with Neoverse-V1) outperforming AWS c6g (with Neoverse-N1) by 2.2x for RESNET-50 and 2.4x for BERT-Large for real-time inference performance.
Figure A: Resnet-50 v1.5 real-time inference performance with a Batch Size = 1. This is from a c7g.4xlarge instance cluster using AWS Graviton3 processors and a c6g.4xlarge instance cluster with AWS Graviton2 processors. Higher is better.
Figure B: BERT-Large real-time inference performance gained by a c7g.4xlarge instance cluster with the AWS Graviton3 processors versus a c6g.4xlarge instance cluster with AWS Graviton2 processor sets. Higher is better.
ML Benchmark AWS Graviton3 (c7g) vs. AWS Intel Ice Lake (c6i)
The following charts summarize the performance comparison of AWS Graviton3 (c7g) versus the AWS Intel Ice Lake (c6i) platforms.
The AWS c7g (based on ARM Neoverse V1) clearly outperforms c6i-based processors on Intel Ice Lake x86 by 1.8x for BERT-Large and 1.3x for RESNET-50 FP32 models. The availability of BF16 instructions in c7g is the reason for better performance, which is missing in the current c6i generation.
Figure C: Resnet-50 v1.5 real-time inference performance (Batch Size = 1) achieved by a c7g.4xlarge instance cluster with AWS Graviton3 processors and by a c6i.4xlarge instance cluster with 3rd Gen Intel Xeon Scalable processors. Higher is better.
Figure D: BERT-Large real-time inference performance achieved by a c7g.4xlarge instance cluster with AWS Graviton3 processors and by a c6i.4xlarge instance cluster with 3rd Gen Intel Xeon Scalable processors. Higher is better.
Graviton-based instances offer the best price-performance combination with the lowest price when compared to x86-based instances. Like the EC2 instances, the SageMaker inference endpoints with ml.c6g instances (Graviton 2) offer a 20% lower price than ml.c5, and the Graviton3 ml.c7g instances are 15% cheaper than ml.c6 instances. For more information, you can refer to Amazon SageMaker Pricing.
The Graviton-powered instances from AWS makes sense to boost your ML workload performances for an optimal price and performance combination. Neurons Lab is an AWS Advanced Partner and helps customers accelerate and scale the adoption of AWS Graviton so they can realize the price performance benefits sooner across more workloads.