Data is the lifeblood of today’s enterprise, but as many organizations are discovering, it’s increasingly the quality of that data that matters more than the quantity. Highly trustworthy training data is crucial for the success of machine learning initiatives. Organizations that attempt to train models based on lower quality data are finding that accuracy ultimately suffers. In fact, with even a small amount of bad, inaccurate, or outdated data, these models are never able to become fully optimized and useful.
To use an analogy, consider what happens when adults try to learn a new language. We’ve been “trained” on our native tongues for decades, and as such we’ve probably picked up a lot of preconceived notions about how language should work. Because we are ingrained with this biased training data, it’s notoriously difficult for adults to learn a new language, as we subconsciously attempt to force our old models on the process. Conversely, children who are raised bilingually have fewer of these issues; they are being “trained” with a bias-free, more balanced, dataset.
We’ve seen the consequences of inaccurate and biased training data time and time again in the industry. Consider Zillow’s misguided home-buying algorithms, which underestimated the accuracy of its home value estimates, causing it to overpay for properties or the racial bias that crept into the COMPAS recidivism algorithm.
The Importance of Data Annotation
Many algorithmic problems that occur are due to poor data quality. One way in which data quality can be improved for ML algorithms is through data annotation, the process of tagging data with certain properties or characteristics. For example, an archive of photos of fruits could be manually annotated as apple, pear, watermelon, and so on, giving an algorithm a head-start when it comes to predicting the identity of other, unlabeled objects. Data annotation can be a tedious, manual chore, but as datasets become very large and complex, annotation can become crucial.
The consequences of poorly annotated data can be not only annoying but also costly, as models must be redesigned, retrained, and run over and over again. In some cases, the consequences can be devastating if an organization makes decisions based on a badly trained ML model, such as a medical algorithm designed to spot signs of cancer in an x-ray.
How Good Is Your Training Data?
So, let’s say your data is already annotated. Is it any good? How can you find out?
Start by assessing the performance of your model. It’s often straightforward to discern whether results are in line with expectations. If your fruit-detection algorithm is coming up with nothing but bananas, you might have a data problem, and some deeper investigation is probably in order.
A good next step (either way) is to take a random sample from your dataset and meticulously audit its annotations. How accurate are the annotations? Are annotations missing? This kind of effort can give you a good sense of how accurate your data will be for a specific task as well as a look at the overall health of the project – assuming the sample you took is truly random. If you are looking for more automated ways of debugging your dataset, tools like Scale Nucleus can help identify errors and gaps in your existing data.
How to Improve Data Annotation Quality
If you’ve uncovered a problem with the quality of your data annotation, you can remediate the problem via a number of tactics, including:
- Make sure your annotation instructions are well-understood. Are annotators incorrectly including the stem of the fruit or the container in which the apples are in as part of the annotation? If they aren’t following instructions perfectly, your annotation efforts will be wasted – and your data may be worse off when the task is finished than when you began. Start slowly by giving annotators a small number of easier tasks, allow them to ask questions, and ensure they understand what makes for a high quality annotation. As you work through the training process, revise your instructions to be as clear and precise as possible, which will help you with the next annotator.
- Add a review cycle. Add a second layer of personnel responsible for overseeing the first-round of annotators. (This second layer of reviewers can be chosen from those annotators who prove themselves adept at high-quality annotations.) This second level does not create new annotations from scratch but rather monitors annotation work, corrects any errors as they are spotted, and adds annotations that might have been forgotten somewhere in the process. This extra check layer can increase the overall quality of your data considerably.
- Try a consensus pipeline. One solid quality tactic is to have multiple people annotate the same data, then use consensus to determine which annotation is correct. If four people tag a fruit as “apple” and one tags it as “orange,” it’s highly likely that the three apple tags are correct and the orange tag can be discarded. If this pipeline also considers the overall accuracy level of each individual annotator, you can build an even higher confidence in your data quality.
- Add a quality screen for annotators. If annotators want to work for you, put them through a quality test first. They have to achieve a set accuracy – 99 percent, perhaps – or else they aren’t allowed to start working on your queue. This step saves you from having to rigorously monitor quality and ensures that you’re starting with high-quality annotations.
- Add evaluation tasks throughout the queue and evaluate your annotated data against this benchmark. Evaluation tasks are tasks that have known correct answers. This benchmark serves as a kind of answer key against which an algorithm can be tested, allowing you to determine the quality of your data without having to manually check every piece of the dataset. When data is being annotated, each annotator should be directed to complete the same set of benchmarks, giving you an ongoing understanding of your dataset’s average quality level. By periodically benchmarking your data, you can ensure nothing has gone off the rails and that annotators are still on target, even after the initial training period has passed. If annotators’ evaluation scores are dropping, you may want to provide more in-depth oversight and training.
To get your data annotated with high quality, you can either leverage your own team of annotators - or you can have Scale’s team of trained annotators do it for you.
- Scale Studio is a data annotation platform for you to use with your own workforce, and it comes with all the tools to implement the best practices listed above, learned from providing billions of high-quality annotations for our customers. If you are training and managing your own team of annotators, Studio provides the annotation tools and the management systems to achieve a high-quality data set.
- Scale Rapid allows you to send your raw data to Scale, and we’ll return high-quality data quickly using the best practices listed above. Simply define the parameters of your project, and send your data out for annotation. There’s no need to reinvent the wheel when you can simply pay as you go and receive production-level data in hours instead of weeks.