OctoML Announces Faster Machine Learning Model Inferencing

SEATTLE, Dec. 16, 2020 (GLOBE NEWSWIRE) -- OctoML, the MLOps automation company for superior model performance, portability and productivity, today demonstrated better model performance on Apple’s M1 chip than Apple’s core inferencing engine. OctoML’s results showcased lower model latency than any of Apple’s own developed software, ranging from a 30% improvement against Apple’s latest Core ML 4 inference engine to a 13x improvement on Apple’s standard Core ML 3. All comparisons were based on the BERT-base model, a common deep learning model used widely for natural language processing tasks, and conducted on both the Mac Mini CPU and GPU.

“Apple is great at showcasing their newest products for the most cutting-edge ML uses,” said Luis Ceze, co-founder and CEO of OctoML. “But in practice, machine learning engineers can struggle to achieve good performance and may spend months trying to manually debug issues. In contrast, OctoML’s work shows how an automated model optimization service can effortlessly add new hardware and immediately deliver superior model performance.”

Apple’s latest Core ML 4 resulted in 139 milliseconds of latency on the CPU and 59 milliseconds in latency on GPU. In contrast, OctoML’s work delivered model latency of 108 milliseconds on the CPU and 42 milliseconds on the GPU. These performance improvements represent a 22% improvement on the CPU and nearly 30% improvement on the GPU and are especially notable because they were produced automatically and only weeks after Apple’s public launch of the M1 chip.

Other performance comparisons included Keras with MLCompute and TensorFlow with Graphdef. For Keras, the Apple M1’s latency on CPU was 579 milliseconds and on GPU was 1,767 milliseconds. For TensorFlow, the M1 demonstrated 512 milliseconds of latency on the CPU and 543 milliseconds on the GPU.

“How did we get better results than Apple’s Core ML 4 in just a few weeks? Two reasons,” said Bing Xu, Principal Engineer at OctoML. “First, the Apache TVM compiler uses a machine learning-based auto-scheduler to search out CPU and GPU code optimization. Second, the Apache TVM compiler is able to automatically fuse qualified subgraphs and directly generate code. Even the best engineers can’t anticipate the interactions between model architectures, computation workloads and hardware target resource availability.”

Additional Resources

Read OctoML’s “On the Apple M1, Beating Apple's Core ML 4 With 30% Model Performance Improvements” blog (with visual chart) here: https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d
Visit OctoML’s public GitHub repository of the benchmarking process here: https://github.com/octoml/Apple-M1-BERT

About OctoML
OctoML applies cutting-edge machine learning-based automation to make it easier and faster for machine learning teams to put high-performance machine learning models into production on any hardware. OctoML, founded by the creators of the Apache TVM machine learning compiler project, offers seamless optimization and deployment of machine learning models as a managed service. For more information, visit https://octoml.ai or follow @octoml.

Media and Analyst Contact:
Amber Rowland
amber@therowlandagency.com
+1-650-814-4560

OctoML Announces Faster Machine Learning Model Inferencing on Apple’s New M1 Than Apple’s Core ML 4 Provides