Glen Wise and Brian Fishman

Cinder AI Blog Series Part Three: LLMs and Content Moderation

Glen Wise and Brian Fishman

OpenAI just publicly disclosed its innovative effort to train a content moderation classifier using human-benchmarked data labeled by GPT-4. The system still depends foundationally on human data labeling, but promises faster iteration, more accuracy, and higher-scale automated content moderation in the future. Cinder is proud to support this revolution and believes that automated content moderation, despite its important limitations, will be critical to the next generation of Trust & Safety. 

Cinder’s Approach to AI Moderation

At Cinder’s founding, we considered whether to build foundational AI models - and decided against it.

On the one hand, we thought that it was incredibly important to build AI that would incorporate platform-specific policies. We had seen the utility of such systems at Meta.

On the other hand, we were convinced that companies like OpenAI, Meta, Google, and Anthropic - with access to massive amounts of data - would build better models than we could hope to train. We also thought that unleashing the power of these systems would ultimately be facilitated by integrating them into a holistic data platform to manage all aspects of Trust & Safety.

So, instead of building foundational models, we focused on a core operational platform prepared to facilitate AI development and utilization. For example:

  • We built simple mechanisms to construct Golden Sets to enable partners to index decisions made by either humans or artificial intelligence. 
  • We built our policy management system to enable the complex instruction that LLMs like GPT-4 thrive on. 
  • We directly integrated our policy management system with review queues so that policy teams can update decision trees in minutes not months. 
  • And we built multiple ways to get decision data out of Cinder so that partners can utilize however they see fit: via real-time Webhook, API, or simple CSV files. 

AI is core to the future of Trust & Safety, but it is not the entire story. As OpenAI highlighted the limitations

“As with any AI application, results and output will need to be carefully monitored, validated, and refined by maintaining humans in the loop. By reducing human involvement in some parts of the moderation process that can be handled by language models, human resources can be more focused on addressing the complex edge cases most needed for policy refinement.”

We agree that human monitoring will be key to production deployment of AI for Trust & Safety and note several related details:

First, content moderation does not constitute all of Trust & Safety. Artificial intelligence and LLMs can be applied to other arenas, including case management, investigations, and law enforcement response. 

Second, Trust & Safety is fundamentally adversarial and iterative. Malicious actors will probe defenses and endeavor to adversarially counter AI-driven moderation just as they do human processes. A prime value of AI-driven automation is faster defensive iteration, but such updates will require constant benchmarking against authoritative human decisions. 

Third, the Trust & Safety decision spectrum includes a variety of choices that are more complex than labeling simple objects. We see a future where practices like training moderation APIs via more sophisticated LLMs can facilitate such reviews reasonably well. However, it’s unclear how well such classifiers will perform assessing behavioral dynamics or the interaction between multiple actors. 

Next Steps

If AI is one great trend in Trust & Safety, another is increased regulation.

Even as automated systems promise improved accuracy at scale, the Digital Services Act requires companies to facilitate user and reporter appeals, provide clear statements of reasons for every content moderation decision, and explain every automated decision to affected users. OpenAI describes work that may address some of these demands in the future, but it is not at all clear how regulators will approach decisions made by AI, even if that AI can explain its “reasoning” in natural language. 

Cinder was built to manage this ambiguity because it is a full-fledged data platform that can easily be configured and reconfigured in response to adversarial shifts and an evolving regulatory environment.

Trust & Safety is simply too complex to try to solve one problem without considering the effect that will have on other problems.

We built Cinder to facilitate effective choices and robust collaboration across the decision spectrum in every unit within a Trust & Safety enterprise. Cinder will continue to support teams building AI models for content moderation - and is developing a series of tools to deploy, manage, and monitor those tools at scale. Efficient, effective, and secure Trust & Safety enterprises should operate holistically - not piecemeal. 

Still, it is clear that tools like GPT-4 and LLM-driven content moderation APIs will dramatically change Trust & Safety. The regulatory approach to such work is currently ambiguous, but we expect that regulators will enable this. We are building Cinder with the expectation that AI-empowered moderation is part of the future, but that regulatory demands for transparency and explanation will mature and grow both more clear and more demanding over time.  

At the end of the day, Trust & Safety is about making decisions of varying complexity accurately, efficiently, and at scale. That means organizing your data correctly, empowering every team and every decision maker at every level of your Trust & Safety enterprise, and doing so in a way that puts you in command of all of your data. If you do that, the specifics - whether they be utilizing new AI tools or adapting to emergent regulation - get a lot easier.