The explosion of off the shelf advanced Artificial Intelligence models is going to dramatically alter Trust & Safety operations. But such tools are not new. Trust & Safety teams have used machine learning models for years; our experience with those tools suggests that the ambiguities associated with their error rates means that the most impactful use cases often involve process improvements rather than simply allowing automated systems to make take-down / leave-up decisions.
Trust & Safety teams use automated systems for a range of reasons, including the extraordinary scale of digital media, the unreliability of human review teams at scale, and the cost associated with those processes. But using automation carries its own risks. Platforms worry about the overall accuracy of such tools, the risk that their errors will be wildly inexplicable, and that the system will miss high-severity but low-prevalence violations.
These concerns are valid. Trust & Safety teams regularly find that well-tuned AI models are more accurate than human reviewers at a large-scale. But AI systems all have error rates, and will always generate both false positives and false negatives. Falsely identifying something as violating may result in over-removals leading to frustrated users, bad PR, and revenue impacts. Falsely leaving violating content in place has many of the same consequences plus raises the risk of real-world harm. AI errors can also be hard to explain. Whereas human errors may be egregious, they often make sense - a misunderstood symbol, not recognizing satire, unfamiliar slang. To the contrary, some AI mistakes can seem almost random, which can undermine trust in the system. There are similarities though: just as AI systems are black boxes, you can never know for certain what a human reviewer was thinking precisely when they mislabeled something. And that is why oversight is so important in both cases.
AI should always respect platform policy, which can vary significantly based on platform architecture, user-base, and strategy. This requires matching AI to the platform specifically and requires setting probability thresholds at which machine learning models may take certain actions. The challenge is that AI systems generally return a “confidence score,” – usually a number between 0 and 1 – that indicates the system’s relative certainty that the item violates a particular rule. The problem is that such scores often do not correspond neatly to the probability that the model’s input actually violates the relevant policy. For example, the OpenAI content moderation API returns a “category_scores” field “denoting the model's confidence that the input violates the OpenAI's policy (sic) for the [policy] category.” If the score is higher, then the model is more confident that the content violates OpenAI’s standard. Importantly, however, it is not clear exactly how much that likelihood increases as the category_score increases. As OpenAI explains, its confidence score (like many other models) does not represent a probability.
This disconnect between confidence scores and probability complicates things for Trust & Safety leaders. Policymakers generally want to determine actions based on the probability that an input to the classifier violates. For example, a policymaker might be comfortable removing content automatically if it is 95% likely to violate the policy. At an 80% probability of violation they might want to enqueue the content for human review. But AI models themselves generally do not consistently provide such probabilities, only a confidence score. As a result, policymakers generally want additional benchmarking and/or will bias toward including human oversight over their AI systems.
The good news is that classifiers can be useful for a variety of tasks beyond simply determining when to remove material from a platform - and some of that value can be captured even when the model’s results are less confident or the Trust & Safety leader does not trust the model.
A key value for classifiers is to optimize the reviews that are conducted by human beings. Classifiers can prioritize those jobs based on criteria built into the model - the likely severity of the harm they represent, the highest confidence in a violation, the material most likely to be distributed widely. Classifiers can also sort content based to improve reviewer efficiency - ensuring material is sorted by language or topic area. The classifier will make errors, but their impact is limited and they can be easily rectified by the human reviewers. Classifiers can also cluster material that can be reviewed in a single job.
Machine learning models can help optimize which jobs are reviewed - and they can improve the review process itself. At Cinder, we call this ‘Assisted Review.’ It includes features like indicators that highlight particular elements of a review that deserve special attention and automatically updating decision trees. The goal is to improve the speed and accuracy of reviews, while still leaving ultimate authority in the hands of a human decision maker. Importantly, such decisions produce both operational data and labeling that can be used to improve the machine learning model itself.
Trust & Safety teams operate at immense-scale and high-speed. In such circumstances, automated summaries can be very useful to facilitate good decision-making. Quickly summarizing complex sets of transactions, long message threads, or protracted video or audio files can improve content reviews; summarizing sets of accounts and internal notes can improve internal communications and reports for law enforcement or broader transparency efforts.
Trust & Safety is fundamentally adversarial - and adversaries are smart. That often manifests as surreptitious account creation and harmful actors returning to a platform after being removed. Machine learning models, perhaps tied to natural language search mechanisms, can query large datasets and identify entities based on language or metadata patterns that might not be apparent, even to a trained investigator.
Just as machine learning can identify accounts that may be controlled by the same individual or network, such tools can also identify patterns that indicate accounts or networks are bots. Such mechanisms have been developed for years, first as complex rules engines and then with more technically sophisticated (though not always more effective) machine learning models. As the volume of automatically-generated content on platforms increases, it will put pressure on moderation teams to keep up. With a good enough bot classifier, teams may be willing to automatically remove content at a lower classifier confidence level for bot-created content than for content manually created by a real user.
Trust & Safety teams produce reports for internal and external customers. Briefings are escalated to senior decision makers; transparency reports are designed for the broader public; and specialized reports are produced for law enforcement and various regulatory agencies.
The substantive improvement of AI and its wide availability are going to change Trust & Safety dramatically, but human beings still must make key judgment calls about policy and must determine how to responsibly integrate these new tools into their overall systems. Human beings will continue to make critical decisions to shape platforms and protect people in the real world, especially regarding more complex investigations of adversarial actors. AI will replace and enhance some Trust & Safety functions, but it cannot replace all of them. Like seemingly every other company, at Cinder we are integrating these new tools in our product. But we also know that such tools are not truly novel and that they do not solve complex Trust & Safety challenges on their own.
The core challenge is to deploy a symbiotic system so that human beings have oversight over AI-driven decisions, human choices drive both operational action and improvements in the AI, and AI can facilitate better human decisions. All of this is possible, but it requires learning from the history of AI in Trust & Safety and thinking about these systems as critical elements in a rapidly expanding toolkit - not the entire toolkit.