TensorFlow service performance optimization on Skylake CPUs


Taboola is the leading recommendation engine in the $80 billion+ open web marketplace. The company’s platform, powered by deep learning and artificial intelligence, reaches nearly 600 million daily active users. The company’s ML inference infrastructure consisted of 7 large-scale Kubernetes clusters across ten on-premises data centers and tens of thousands of cores.

Over the past few months, we’ve seen an increase in Kubernetes worker node load averages that initially seemed to align with increased traffic levels. However, further investigation revealed significant discrepancies in server load and request ratio between data centers. Closer examination noted a notable correlation with the proportion of Skylake CPUs in each cluster, particularly in Chicago and Amsterdam, where Skylake usage was significantly higher (46% and 75%, respectively) compared to other locations ( 20-30%).

The Skylake CPU Frequency Challenge

It has been observed and documented on the web that Skylake CPUs experience a frequency drop when using the AVX512 instruction set, and the impact intensifies as more cores use it. While we previously adopted custom compiled binaries with optimized mtune/march signals to disable or enable AVX512 on specific CPU models in TensorFlow 1.x, the migration to TensorFlow 2.x initially seemed to rectify the issue with performance improvements outside of the current Box.

Root Cause Disclosure: oneDNN’s JIT and AVX512

Finding the root cause in an environment that includes heterogeneous CPU families, constant models, and changing traffic levels can be challenging, so our troubleshooting journey started with the basics like checking that the servers governor is configured for performance, there was no change in the servers tuning profile. and examine environmental factors such as the ambient temperature of the servers to rule out external influences. A closer look at the performance metrics showed that all servers in the data center had an average increase in load. (We use a “fewer requests” load balancing algorithm to compensate for heterogeneous CPU types).

It was interesting to note that there was more impact on Skylake based servers. This impact included frequency reduction and degradation of CPU core metric inference requests.

Next, we focused on investigating the use of AVX512 in Skylake servers. Initial probes using arch_status under the /proc file system suggested minimal AVX512 involvement:

But as the documentation points out, there is an option for false negatives.
Running more thorough analyzes with perf uncovered a contrasting scenario:

The perf output illuminated substantial tier 2 core acceleration attributed to AVX512, examining the ratio of tier 2 to tier 1 acceleration showed a ratio of 4 out of 5.

In parallel, our Tensorflow workload, where in the process of upgrading from version 2.7 to 2.13 by analyzing some Tensorflow 2.7 workloads with perf, showed much lower AVX512 usage with a ratio of ‘1 of 3.

Note that in our environment, the Tensorflow process is not pinned to specific cores, so the results have a lot of CPU time “noise” between processes on the same physical core.

However, the TensorFlow release code did not explicitly enable AVX512 compilation, with only the AVX and SSE4.2 flags present in the .bazelrc file. The key discovery comes from the TensorFlow 2.9 release notes: oneDNN was now enabled by default. This library dynamically detects compatible CPU instruction sets such as AVX512 and uses Just-in-Time (JIT) compilation to generate optimized code.

Workaround: Limiting the oneDNN instruction set

Following the oneDNN documentation, we implemented a workaround to limit the instruction set based on CPU family. Added a simple test to the image entry point bash script to detect the CPU type and set the following environment variable to restrict oneDNN to AVX2:

Result: increased performance and reduced load

Implementing this fix resulted in a noticeable improvement:

  • ~11% increase in average CPU frequency.
  • ~11% decrease in 15 minute maximum load average.
  • ~10-15% increase in inference requests per CPU core used.

Key considerations and considerations:

  • Prioritize CPU compatibility: When deploying deep learning workloads across multiple CPU architectures, carefully evaluate instruction set support and potential performance implications.
  • Stay informed about framework updates: Regularly review the release notes and changelogs of frameworks like TensorFlow to understand the features introduced and their potential impact, especially on specific hardware configurations.
  • Take advantage of performance analysis tools: Use tools like perf to get detailed information about CPU behavior and identify performance bottlenecks.

Let’s Be Greedy: Compiling Tensorflow with -march Tuning

TODO: Add tensorflow build with Icelake flag tests…

After the successful tuning of the instruction set for Skylake, we wanted to see if we can squeeze a few more drops of performance out of it.
We compiled a few Tensorflow service binaries with different running options by adding the following to .bazelrc and passing the relevant build parameter at build time:

After pushing the images to our local repository, we created a custom image and COPYED the various binaries.

Let’s configure the ENTRYPOINT to launch a bash script that will execute the binary based on the CPU family.

After deploying the new image to production and monitoring its performance, we did not see any performance increase, it seems that the OneDNN library does the heavy lifting.

Moving forward: continuous monitoring and adjustment

Our experience highlights the importance of proactive performance monitoring and optimization, especially in heterogeneous infrastructures. We’ll continue to closely monitor cluster performance, explore deeper optimizations where appropriate, and stay abreast of developments in TensorFlow and related libraries. We are already excited to test the impact of AMX mixed precision on our servers with the Intel Sapphire Rapids CPU family.