OpenAI recently published a paper on fine-tuning GPT-2, where they used Scale AI to collect the preferences of human labelers to improve their language models. Although we were already labeling millions of text and computer vision tasks per day at the time, the unique latency requirements and subjective nature of OpenAI’s tasks posed a new challenge for us. In particular: how do you scalably maintain the quality of labels, without having labelers check each other’s work? Today we’re sharing a deep dive into our approach to the problem, the automatic benchmark mining system we built to solve it, and the things we learned along the way. In doing so, we hope to illustrate some of the many challenges that make scalable data labeling such an interesting area of work.
The Problem
First, let’s take a look at one of these tasks. Given a snippet of text and some possible continuations for it, the researchers at OpenAI wanted humans to pick the continuation that best fit a given prompt, such as “which continuation has the most positive sentiment” or “which continuation is the most descriptive”.
But instead of labeling this data in a big batch like we usually do (offline), OpenAI wanted to put their model in a closed loop with our labelers, so the data was labeled online: the model would generate some text samples, have the humans rate it using our API, train based on human preferences, and then repeat this process over a few days.
This meant that in order for an experiment to have enough epochs of training, and still finish quickly, the tasks needed to be done with very low latency (<30 mins) at high throughput (~5,000 labels/hr). As we will explain later, these constraints were unique and required us to build additional functionality into our data labeling pipeline.
Furthermore, the answers to these tasks were also extremely subjective. As an example, here is a real task where one has to pick which continuation is the most descriptive:
Can you pick an answer with certainty?
This ambiguity makes it hard to tell the people that are making a genuine effort apart from the people (or bots!) that are guessing randomly. However, we must make that determination correctly in order to scale this up to thousands of labelers, and still deliver high quality data to our customer.
Measuring Quality at Scale
When we first built our labeling pipeline, our customers wanted us to prioritize the highest quality data over other factors such as the fastest turnaround SLAs. Now we had to figure out a way to keep our high quality bar while boosting our turnaround from days to minutes.
One reason for higher turn-around time was redundancy. On most projects, we send each task through a pipeline of multiple people, either in series or in parallel. Having this redundancy greatly reduces the probability of errors. It also allows us to estimate the accuracies of each person by grading their responses against the completed version of the task.
The other reason for higher turn-around time was lagging quality indicators. We usually hold each batch of tasks in a staging state before it is sent back to a customer, so that our confidence indicators on labelers can catch up based on new information. We can then relabel tasks we have low confidence in.
However, due to OpenAI’s latency and throughput constraints, we would not have been able to use either of these features of our architecture — tasks would have to be immediately sent back to the customer after a single person had attempted them.
Fortunately, having people review other people’s work is just one of the ways we get signal on labeler quality. For example, on some computer vision projects, one feature that determines labeler quality is based on localization loss from a deep learning model.
Benchmark Tasks
Another way we measure quality is using a mechanism called benchmark tasks. The idea is to collect high-confidence responses to a subset of tasks first (called benchmark tasks — or a golden dataset in literature), and then use them to estimate the quality of active labelers. The benchmark tasks are sprinkled among new tasks served to people, and disguised to be indistinguishable from them. We can then guarantee that we never return tasks to a customer from labelers who do not meet the quality threshold on benchmark tasks.
We had been successfully using benchmark tasks as another way of measuring quality, but these were typically created manually by our in-house quality team. This manual creation would not have worked for this project for a few reasons:
we wanted to support OpenAI in creating new projects and iterating quickly on experiments without being blocked on the Scale team creating benchmarks.
due to the ambiguous nature of tasks, it was harder to manually find tasks for which we could have high certainty in the answer.
To solve these problems, we augmented our pipeline with a system that automatically mines and maintains a set of benchmark tasks.
Once a project came in, our system would take a small subset of tasks from it, and have each one done by multiple trusted labelers. If those people almost all reached consensus that one of the responses on a task was correct, we would make this task a benchmark task, with that response. The broader remote workforce would only be allowed on the project once we had enough benchmarks.
For added safety, only people we had the most confidence in were selected for benchmark generation. This is in line with a general principle we use in designing our operational systems:* leverage people we have highest trust in*. Here, if the most accurate labelers are working on creating benchmarks, their work can then be used to validate the work of a much larger set of people.
Since we only pick tasks that reach consensus, unambiguous tasks are more likely to become benchmarks than ambiguous ones with no clear answer. To an extent, this is helpful because we can be confident our evaluation of quality has low noise. The caveat is that this set of benchmark tasks is no longer representative of our original set of tasks.
In this particular case, this was fine, since we were hoping to eliminate malicious and under-qualified labelers, while still allowing for expression of human preference and subjectivity. The need to understand this relationship between test question difficulty and false positives/negatives is one that recurs frequently for us, and has also been studied extensively in fields such as Item Response Theory.
Making This Work
With a complex system that includes free-willed actors (human labelers) with unpredictable behavior, every new feature needs multiple iterations — cycles of deploying changes, observing the system’s reaction to them, and then responding to new developments — in order to make it actually work. The benchmark generation system was no exception. Here we go over two of the many improvements that we made.
As with any job, even the highest quality labelers can be inconsistent on their quality from one day to the next. Initially, we distributed benchmark tasks at a fixed uniform sample rate, which meant that it took a long time before our system could confidently adjust its quality read for a given labeler. We evolved our logic for serving benchmark tasks to instead vary its rate dynamically, and serve more benchmarks as soon as a dip in quality was suspected.
Things ran smoothly until we got to the descriptiveness project, in which labelers had to pick the most descriptive continuation among 4 options. Our quality audit showed a dip, confirmed by the fact that the average time people were spending on these tasks had also started to fall.
What was special about this project?
It turns out that the correct answer (the most descriptive option) had an extremely high likelihood of also being the longest option. This was true often enough that a strategy of “pick the longest option” would be within our tolerance of quality, and prove far more lucrative than acting honestly.
This demonstrates the need to care about the distribution of benchmark tasks. By explicitly including and weighting anti-bias examples (in this case, examples where the longest option is not correct) in our benchmark set, we could correct for this issue.
A Note on Testing and Monitoring
Every distributed system needs to be tested and robustly monitored, and a human-based distributed system like ours is no different.
How do we test changes if our system depends on unpredictable human behavior? At Scale, we usually test things like that by launching “synthetic projects” — a sandboxed copy of a real project we’ve already done before, with a similar population of labelers as the real project. This allows us to experiment and iterate quickly without fear of disrupting the actual labeling pipeline for a customer. We leveraged synthetic projects to tune our parameters, test hypotheses and also at one point to keep our workforce “warmed up” by giving them tasks to do as we waited for the customer to send in more tasks.
In addition to this, we need to monitor the system and output quality to ensure that there are no regressions. This is how we quickly caught and responded to some of the issues described above. We set up alerts for when inter-labeler agreement rate, throughput or average time per task fell outside the expected range of values.
In addition to automated alerts, we need constant qualitative monitoring on all of our projects, which is done by our in-house quality team. This project was no different. To check this system, we set up a process to audit labelers close to our filtering boundary based on benchmarks. This allowed them to alert us if the system was erroneously filtering out labelers who were good (false positives) or not filtering out labelers below quality (false negatives).
Finally, since we had access to a small sample of data labeled by the researchers at OpenAI themselves, we set up a daily cron on our async job service that would have the same sample re-labeled by workers, and alert if the quality on this was lower than desired.
What's Next
Since building the prototype of this system and learning from it, we have incorporated it into all natural language and categorization use cases as well as parts of our computer vision pipeline. We have also improved the accuracy of our classifier that determines whether labelers are good quality using a Bayesian model. We will publish blog posts describing those efforts soon.
This post emphasizes how we're solving quality measurement, but there's a whole world of other problems we're tackling that are also vital to the data labeling process, such as: using ML to make data labeling orders of magnitude faster, scalably teaching complex instructions, building state of the art 3D point cloud rendering tools, and setting up the data infrastructure to support all these systems. If you’re interested in joining us in solving these problems, and powering the next generation of ML applications, take a look at our careers page for the latest open positions. If you have projects that require high quality data labeling, let our team know!