Test and Evaluation Vision

Building Trust in AI: Our vision for testing and evaluation


Contents

    Background & Motivation

    Over the past year, large language models have quickly risen to dominate the public consciousness and discourse, ushering in a wave of AI optimism and possibility, and upending our world. The applications of this technology are endless—from automation across professional services to augmentation of medical knowledge, and from personal companionship to national security. And the rate of technological progress in the field is not slowing down.

    These newly unlocked possibilities undoubtedly represent a positive development for the world. Their impact will touch the lives of billions of people, and unlock step-function advancements across every industry, with potentially greater implications on the future of our world than even the internet. But they are also not without their risks. At its most extreme, AI has the potential to strengthen global democracies and democratic economies—or be the decisive implement that enables the grip of authoritarianism. As Anthropic wrote in a July 2023 announcement, “in summary, working with experts, we found that models might soon present risks to national security, if unmitigated.”

    At Scale, our mission is to accelerate the development of AI applications. We have done this by partnering deeply with some of the most technologically advanced AI teams in the world on their most difficult and important data problems—from autonomous vehicle developers to LLM companies, like OpenAI, leading the current charge. From that experience, we know that to accelerate the adoption of LLMs as effectively as possible and mitigate the types of risks which harbor the potential to set back progress, it is paramount we adopt proper safety guardrails and develop rigorous evaluation frameworks.

    Here, we outline our vision for what an effective and comprehensive test & evaluation (“T&E”) regime for these models should look like moving forward, how that leverages human experts, as well as how we aim to help service this need with our new Scale LLM Test & Evaluation offering.

    Understanding the T&E Surface Area

    Defining Quality for an LLM

    Unlike other forms of AI, language generation poses particularly unique challenges when it comes to objective and quantitative assessment of quality, given the inherently subjective nature of language. 

    While there are important quantitative scoring mechanisms by which language can be assessed, and there has been meaningful and important progress in the field of automated language evaluation, the best holistic measures of quality still require assessment by human experts, particularly those with strong writing skills and relevant domain experience.

    When we discuss test & evaluation, what do we really mean? There are broadly five axes of “ability” for a language model, and what T&E seeks to enable is effective adjudication of quality against these axes. The axes are:

    1. Instruction Following—meaning: how well does the model understand what is being asked of it?
    2. Creativity—meaning: within the context of the model’s design constraints and the prompt instructions it is given, what is the creative (subjective) quality of its generation?

    3. Responsibility—meaning: how well does the model adhere to its design constraints (e.g. bias avoidance, toxicity)?

    4. Reasoning—meaning: how well does the model reason and conduct complex analyses which are logically sound?

    5. Factuality—meaning: how factual are the results from the model? Are they hallucinations?

    When viewing a model through this framework, we group evaluations against these axes as either evaluating for capabilities or “helpfulness,” or evaluating for safety or “harmlessness.”

    An effective T&E regime should address all of these axes. Conceptually, at Scale we do this by breaking up the question into model evaluation, model monitoring, and red teaming. We envision these as continual, ongoing processes with periodic spikes ahead of major releases and user rollouts, which serve to mitigate the drift and development of large language models.

    Model Evaluation

    Model evaluation (“model eval”), as conducted through a combination of humans and automated checks, serves to assess model capability and helpfulness over time. This type of evaluation consists of a few elements:

    • Version Control and Regression Testing: Conducted on a semi-frequent basis, aligned with the deployment schedule of a new model, to compare model versions.

    • Exploratory Evaluation: Periodic evaluation, conducted by experts at major model checkpoints, of the strengths and weaknesses of a model across various areas of ability based on embedding maps. Culminates in a report card on model performance, and accompanying qualitative overlay.

    • Model Certification: Once a model is nearing a new major release, model certification consists of a battery of standard tests conducted to ensure minimum satisfactory achievement of some pre-established performance standard. This can be a score against an academic benchmark, an industry-specific test, or a separate regulatory-dictated standard.

    Model Monitoring

    In addition to periodic model eval, a reliable T&E system requires continuous model monitoring in the wild, to ensure that users are experiencing performance in line with expectations. To do this passively and constantly, monitoring relies on automated review of all or a rolling sample of model responses. When anomalous or problematic responses are detected, they can then be escalated up to an expert human reviewer for adjudication, and incorporated into future testing datasets to prevent the issue. 

    A T&E monitoring system of this variety should be deeply embedded with reliability and uptime monitoring, as continuous evaluation for model instruction following, creativity, and responsibility become new elements of traditional health checks on service performance.

    Red Teaming

    Finally, while point-in-time capability evaluation and monitoring is important, they are insufficient on their own in ensuring that models are well-aligned and safe to use. Red teaming consists of in-depth, iterative targeting by automated methods and human experts of specific harms and techniques where a model may be weak, in order to elicit undesirable behavior. This behavior can then be cataloged, added to an adversarial test set for future tracking, and patched via additional tuning. Red teaming exists as a way to assess model vulnerabilities and biases, and protect against harmful exploits.

    Effective expert red teaming requires diversity and comprehensiveness across all possible harm types and techniques, a robust taxonomy to understand the threat surface area, and human experts with deep knowledge of both domain subject matter and red teaming approaches. It also requires a dynamic assessment process, rather than a static one, such that expert red teamers can evolve their adversarial approaches based on what they’re seeing from the model. As outlined in OpenAI’s March 2023 GPT-4 System Card:

    Our approach is to red team iteratively, starting with an initial hypothesis of which areas may be the highest risk, testing these areas, and adjusting as we go. It is also iterative in the sense that we use multiple rounds of red teaming as we incorporate new layers of mitigation and control, conduct testing and refining, and repeat this process.

    The types of harms that one would look for is varied, but includes cybersecurity vulnerabilities, nuclear risks, biorisk, consumer dis/misinformation, and any technique type which may elicit these. 

    Another factor to consider in red teaming is the expertise and trustworthiness of the humans involved. As Google published in a July 2023 LLM Red Teaming report:

    Traditional red teams are a good starting point, but attacks on AI systems quickly become complex, and will benefit from AI subject matter expertise. When feasible, we encourage Red Teams to team up with both security and AI subject matter experts for realistic end-to-end adversarial simulations.

    Anthropic echoed this in their own July 2023 announcement, writing:

    Frontier threats red teaming requires investing significant effort to uncover underlying model capabilities. The most important starting point for us has been working with domain experts with decades of experience [...] However, one challenge is that this information is likely to be sensitive. Therefore, this kind of red teaming requires partnerships with trusted third parties and strong information security protections.

    Because red teamers are often given access to pre-release, unaligned models, these expert individuals must be extremely trustworthy, from both a safety and confidentiality standpoint.

    Helpfulness vs. Harmlessness

    Empirically, when optimizing a model, there exists a tradeoff between helpfulness and harmlessness, which the model developer community has openly recognized. The Llama 2 paper describes the way Meta’s team has chosen to grapple with this, which is by training two separate reward models—one optimized for helpfulness (“Helpfulness RM”) and another optimized for safety (“Safety RM”). The plots below demonstrate the potential for disagreement between these reward models.

    Because there exists this tradeoff between helpfulness vs. harmlessness, the desired landing point on this spectrum is a function of the use case and audience for the model. An educational model designed to serve as a chatbot for children doing their homework may land in a very different place on this spectrum than a model designed for military planning. For T&E, that means that assessing model quality is contextual, and requires an understanding of the desired end use and risk tolerance.

    Vision for the T&E Ecosystem

    With this shared understanding of what goes into effective test & evaluation of models, the question becomes: what is the optimal paradigm by which T&E should be institutionally implemented?

    We view this as a question of localizing the necessary ecosystem components and interaction dynamics across four institutional stakeholder groups:

    1. Frontier model developers, who innovate on the technological cutting edge of model capabilities

    2. Government, which is responsible for regulating the models’ use and development by all, and uses models for its own account

    3. Enterprises and organizations seeking to deploy the models for their own use

    4. Third party organizations which service the aforementioned three stakeholder groups, and support the ecosystem, via either commercial services or nonprofit work

    Making sure that these players work harmoniously, toward democratic values, and in alignment with the greater social good, is paramount. This ecosystem is represented in the graphic below:'

    The Frontier Model Developers

    The role of the frontier model developers in the broader T&E ecosystem is to advance the state of the technology and push the bounds of AI’s potential, subject to safeguards and downside protection. These are the players which develop new models, test them internally, and provide them to consumer and/or organizational end users.

    Doing this safely starts by ensuring that each new model version is subject to regression testing, as developers iterate on improvements. This is best done via a static set of test prompts, across known areas. At major model checkpoints, they will launch exploratory evaluations to gain a more comprehensive and thorough understanding of their model’s strengths and weaknesses, which includes targeted red teaming from experts. Finally, once a model is ready for release, model developers will launch certification tests, which are standardized across various categories of risk or end use (e.g. bias, toxicity, legal or medical advice, etc.), with fewer in-depth insights, but resulting in an overall report card of model performance.

    In order to ensure that all model developers are benefitting from shared learnings, there should also exist an opt-in red teaming pooling network for model developers, facilitated by a third party, which conducts red teaming across all models, aggregates red teaming results from internal teams at the model developers (and the public, where applicable), and alerts each participant developer of any novel model vulnerabilities. This is valuable because research has demonstrated that these vulnerabilities may at times be shared across models from different developers (see “Universal and Transferable Adversarial Attacks on Aligned Language Models”). At the red teaming expert level, this model should compensate participants on the basis of value attribution, from what they are able to discover and contribute, not dissimilarly from traditional software bug bounty programs.

    Government

    The role of government in the T&E ecosystem is twofold:

    1. Establishment of clear guidelines and regulations, on a use case basis, for model development and deployment by enterprises and consumers

    2. Establishment and adoption of standards on the use of frontier models within the government itself

    The more important of these two roles is the former, as a regulator and enforcer of standards. Debates have been ongoing of late as to how to best regulate AI as a category, and the manner by which legislators should seek to balance the macro version of the helpfulness vs. harmlessness tradeoff—that is, in adopting more restrictive legislation which seeks to avoid all potential harms, vs. lighter guardrails which optimize for technological and economic progress. 

    We believe that proper risk-based test & evaluation prior to deployment should represent a key cornerstone for any legislative structure around AI, as it remains the best safety mechanism we have for production AI systems. It is also important to remember that determining a reasonable risk tolerance for large language models depends significantly on the intended use case, and it is for that reason that legislatively centralizing novel standards and their enforcement for AI beyond general frameworks is extremely difficult. However, we should absolutely leverage our existing federal agencies, each with valuable domain specific knowledge, as forces for regulating the testing, evaluation, and certification of these models at a hyper-specific, use case level, where risk level can be appropriately and thoughtfully factored in.

    There should consequently exist a wide variety of new model certification regulatory standards, industry by industry, which government helps craft in order to ensure the safety and efficacy of model use by enterprises and the public. 

    Separately, as the US Federal Government and its approximately 3 million employees adopt many of these new frontier models themselves, they will simultaneously need to adopt T&E mechanisms to ensure responsible, fair, and performant usage. These will largely overlap with the mechanisms employed by enterprises as described below, but with some notable differences on the basis of domain—e.g. the Department of Defense will need to leverage T&E systems to ensure adherence to its Ethical AI Principles, or any comparable standards released in the future, and will need to optimize for unique concerns such as the leaking of classified information.

    In many cases, to keep up with the pace of innovation, effective operational T&E within the government will require contracting with a third party expert organization. This is precisely why Scale is proud to serve our men and women in all corners of government via cutting edge LLM test & evaluation solutions developed alongside frontier model developers.

    Enterprises

    As the conduit for the majority of end model usage, the role of enterprises in the T&E ecosystem beyond the work done by the model developers (and often for uses and extensions unforeseen by the original developers) is equally important. 

    As enterprises leverage their proprietary data, domain expertise, use cases, and workflows to implement AI applications both internally and for their customers and users, there needs to be constant production performance monitoring. This monitoring should allow for escalation to human expert reviewers when automatically flagged examples which are outliers in existing T&E datasets arise.

    And finally, as enterprises start to, in a smaller way, become model developers themselves by fine-tuning open source models (such as via Scale’s Open Source Model Customization Offering), they or the fine tuning providers they work with will need to adopt many of the same T&E procedures as the frontier model developers, including model eval and expert red teaming.

    The notable difference for enterprise T&E will be the existence of industry- and use case-specific standards for model performance, which will be critical in ensuring responsible, fair, and performant use of these models in production. Certain enterprises will establish their own internal performance standards, but above and beyond that there need to exist standards on the models’ use enforced by regulatory bodies in the relevant domains, as discussed above. The achievement of these standards should be adjudicated on a regular cadence by a third party organization, and be recognized by the bestowment and maintenance of official certifications, as is the case for certain information security certifications today.

    Third Party Organizations

    Within this model, the fourth and final group is the set of third party organizations which contribute to this ecosystem by supporting the aforementioned three classes of stakeholders. These encompass academic and research institutions, nonprofits and alliances, think tanks, and commercial companies which service this ecosystem. 

    Scale falls into this final group, as a provider of human and synthetic-generated data and fine tuning services, automated LLM evaluations and monitoring, and most importantly, expert human LLM red teaming and evaluation, to developers and enterprises. Scale also acts as a third party provider for both model T&E and end user AI solutions, to the many public sector departments, agencies, and organizations which we proudly serve.

    The roles of these parties may vary from policy thinking to sharing of industry best practices, and from providing infrastructure and expert support for the effective execution of model T&E to establishing and maintaining performance benchmarks. There will need to exist a diverse and robust set of organizations in order to properly support T&E.

    Working with Scale

    Today, we are excited to announce the early access launch of Scale LLM Test & Evaluation, a platform for comprehensive model monitoring, evaluation, and red teaming. We are proud to have helped pioneer many of these methods hand-in-hand with some of the brightest minds in the frontier model space like OpenAI, as well as government and leading enterprises, and we are ready to continue accelerating the development of responsible AI.

    You can find us at DEFCON 31 this year where we are providing the T&E platform for the AI Village’s competitive Generative Red Team event as the White House’s evaluation platform of choice, and learn more about Scale LLM Test & Evaluation.