Resolution of discrepancies between training and service


Introduction

In machine learning, a critical challenge is to ensure that the features used during model training (offline) match those used during inference time (online publication). Discrepancies between training and service roles can lead to significant performance degradation, so it is crucial to identify and address these inconsistencies as quickly as possible.

At Taboola, we specialize in content discovery and native advertising, enabling users to find and interact with personalized content on the web. Our advanced machine learning-based recommendation systems deliver billions of recommendations every day, helping publishers, advertisers and brands reach their target audiences effectively.

In this blog post, I’ll discuss the challenges of training service function mismatches in machine learning models and how they can affect model performance. I’ll explain how we address these discrepancies at Taboola, including the design and implementation of our solution, Sherlock. Finally, I’ll discuss some of the key discoveries made and the significant impact Sherlock has had on our recommender systems.

Discrepancies?? No way…

Discrepancies between training and service roles can occur for a number of reasons. The most common reason is the way we handle large functions. Due to their significant storage requirements, these functions are often not reported back as they are after being served in the online environment. Instead, before training, they are recalculated from sampled data and various data sources. The recalculated characteristics may differ from the original calculations made during the service. This may be due to data differences, changes in external data between service and training times, or calculation logic differences.

Another cause of discrepancies can be cache misses and database query timeouts in the online service environment. In order for Taboola to return recommendations to the client in a matter of hundreds of milliseconds we need to wrap the database queries with a very tight limit. These errors and timeouts can cause certain functions to have no values ​​in publication, while in the reporting pipeline they will have values ​​due to less strict timeouts.

Furthermore, integration errors can introduce further discrepancies. Poorly integrated components or faulty data pipelines can cause mismatches between features used during training and those available during release.

Meet Sherlock

To address this problem, we developed Sherlock. Sherlock is a robust system designed to detect and alert on training feature discrepancies in our models. Developed as part of our ongoing efforts to improve the reliability and performance of our recommender systems at Taboola, Sherlock allows us to quickly detect discrepancies and quickly address them, ensuring consistent and accurate model performance.

How Sherlock works

The Sherlock pipeline consists of three main parts:

  1. Sample: Save a sample of the feature values ​​that have been used for publication.
  2. Preprocess: Construct the values ​​of the training functions.
  3. Compare: Comparison of service and training function values ​​and generation of warning/alerts based on results.
Figure 1: Main parts of Sherlock

Phase 1 – Sampling

Saving all values ​​of publishing environment properties is impractical due to the huge amount of data it would generate. To address this challenge, we implemented a sampling strategy where feature values ​​are reported for 1 in 10,000 requests handled in Taboola. This approach significantly reduces the volume of data while providing a representative sample of the feature values ​​used in the publication.

The values ​​of the displayed features are included in the page view objects in HDFS. A pageview is a record that encapsulates various details about a user’s visit to a web page, including recommendations posted, user activities such as clicks, and more.

To further speed up the process, we run a Spark task that aggregates these page views containing the values ​​of the displayed features once per hour. This job collects all relevant pageviews from the previous hour that have sampled data and copies them to a dedicated location on HDFS, making them easily accessible for the next stage of Sherlock’s workflow.

Figure 2: sampling

Stage 2 – Pre-process

Preprocessing refers to a group of Spark jobs designed to prepare data for model training jobs. These jobs read pageview data from HDFS based on a range of dates and perform a range of operations such as filtering, enrichment with external data sources, and calculations. The result is a complete dataset containing all model features, along with training and test folders full of TFRecords used as inputs to the training job. All these outputs are stored in HDFS.

For Sherlock, our main focus is the feature dataset. To adapt to Sherlock’s requirements, we developed a special mode within our existing preprocessing jobs, characterized by the following key points:

  1. Reading input data from displayed pageviews: The job reads input data exclusively from the path specified in stage 1, which contains only the page views that contain the values ​​of the displayed publication characteristics.
  2. No filtering or sampling: Unlike our standard pre-processing before the training job, this special mode processes all data without any filtering or sampling, ensuring that all data is processed.
  3. Calculation of characteristics: The job computes all features for training as it would in a standard preprocessing run, ensuring that the training feature values ​​are computed as they are computed in the standard production run.
  4. Including post sample function values: In addition to the calculated features, the work includes the sampling features of the portion exactly as they were recorded. This results in a data set where each feature is represented twice: once with the service values ​​and once with the Preprocess calculation (the one that should have been used during model training).
Figure 3: pre-process

Stage 3: comparison

The comparison stage is the core of Sherlock, where the training and service values ​​of the features are analyzed and discrepancies are detected. This work is implemented in Python and leverages the Pandas package, enabling efficient and fast comparison of large amounts of data.

Because there are several types of features, such as integers, strings, lists, and more, some comparisons are simple while others are more complex. Basic types are automatically detected from data, while complex types are defined by function in configuration files. Each function is configured with warning and alert thresholds to identify significant discrepancies. Additional metrics, such as percentage of no-vocabulary values ​​(OOV), are also calculated for each feature.

Once the comparison is complete, the results are uploaded to Google BigQuery and also made visible in our easily accessible logs. To automate the monitoring of these results, we’ve defined automatically scheduled queries in BigQuery that check functions whose comparison results exceed their warning or alert thresholds. For features that exceed the warning threshold, a Slack message is sent to a dedicated channel. For features that exceed the alert threshold, a Slack message is sent along with a PagerDuty notification to the person on duty, ensuring quick attention to critical discrepancies.

By implementing these automated checks and notifications, Sherlock ensures that any significant discrepancies between training and service functions are quickly identified and addressed, maintaining the integrity and performance of our machine learning models.

Figure 4: comparison

Sherlock’s discoveries and impact

Since its release, Sherlock has identified several broken features, allowing us to make critical fixes that have significantly improved our models. Here are some examples of the types of errors that Sherlock has discovered:

  1. Logic not equal to function creation: A common problem was the use of different logics to create functions during service and training. The solution involved making sure the same function was used on both sides, eliminating duplicate logic that could only be mistakenly altered on one side.
  2. Inconsistent feature calculations: In some cases, functions were calculated differently for publishing and then recalculated with different logic for inclusion in the page view, resulting in a report value broken The solution was to calculate the function only once and use the same value twice. When this was not feasible, we ensured that the calculation logic remained consistent.
  3. Configuration differences: We have had several cases where the function build logic is the same for service and training, but the configuration was not the same. For example, the OOV values ​​were not equal on both sides, resulting in inconsistent models.

One of the most notable fixes involved fixing a broken feature that resulted in a significant increase in Taboola’s global revenue per million (RPM) by 1.31%. This improvement highlights the substantial impact Sherlock has had on our system performance.

A major impact of Sherlock is that it allows us to make the right calls regarding our models. For example, when Sherlock first turned on, it marked several features as broken – features that were on the verge of being removed from our models due to their perceived lack of value (which, of course, was because they were broken ). Instead of removing them, we fixed the features and saw the positive impact they had on working properly.

conclusion

Discrepancies between training service functions can be a real pain, reducing the accuracy of machine learning models and potentially leading to financial losses. When we noticed that the features that previously increased the RPM of our models stopped doing so, it became clear that we needed a solution like Sherlock.

Sherlock’s impact has been substantial for Taboola. By ensuring our features are free of discrepancies, it has improved the reliability and accuracy of our machine learning models. Sherlock allows us to quickly identify and fix bugs, even during the development and testing phases of new features. It also provides a safety net that alerts us if a feature is accidentally broken and helps maintain the integrity of our recommendation systems.



Technology

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post