Abstract
When an AI system decides whether a user’s message is safe, it typically makes a binary choice: toxic or not toxic. This all-or-nothing approach is fundamentally flawed. It forces platforms to either over-censor harmless conversations or let genuinely dangerous content slip through. This paper argues that AI safety requires a third option—a MEDIUM-risk tier where uncertain or context-dependent content is routed to human reviewers rather than decided by a machine alone. We build and evaluate a three-tiered risk classifier (LOW, MEDIUM, HIGH) that assigns each user prompt a calibrated confidence score, enabling graduated moderation: auto-approval for safe content, human review for ambiguous cases, and immediate blocking for clear threats. Tested on 6,000 prompts, the system achieves 85% accuracy—outperforming Google’s Perspective API, Detoxify, and Meta’s LLaMA Guard by up to 46 percentage points—while running fast enough for real-time use on ordinary hardware without GPUs. Crucially, adversarial testing reveals that the classifier catches 100% of explicit attacks but only 67% of subtle, implicit harm such as veiled self-harm language. This gap is precisely the point: no AI system, however sophisticated, can reliably interpret every shade of human intent. The MEDIUM tier exists to acknowledge this limitation honestly and to keep humans in the loop where it matters most. These findings carry a clear message for the future of AI safety: moving beyond binary moderation toward three-way risk classification with built-in human oversight is not merely an improvement—it is a necessity. With the EU Digital Services Act and the UK Online Safety Act now mandating proportional moderation and meaningful human oversight, the MEDIUM tier offers a concrete, auditable pathway to regulatory compliance.
Keywords: Content moderation, large language models, risk classification, human oversight, AI safety
