A Calibrated Three-Tiered Risk Classifier for User Prompts in Large Language Model Content Moderation

Opina, Louie Ardy S.

A Calibrated Three-Tiered Risk Classifier for User Prompts in Large Language Model Content Moderation

Author(s):: Opina, Louie Ardy S.

Publication Date:: 2026-04-15

Abstract

When an AI system decides whether a user’s message is safe, it typically makes a binary choice: toxic or not toxic. This all-or-nothing approach is fundamentally flawed. It forces platforms to either over-censor harmless conversations or let genuinely dangerous content slip through. This paper argues that AI safety requires a third option—a MEDIUM-risk tier where uncertain or context-dependent content is routed to human reviewers rather than decided by a machine alone. We build and evaluate a three-tiered risk classifier (LOW, MEDIUM, HIGH) that assigns each user prompt a calibrated confidence score, enabling graduated moderation: auto-approval for safe content, human review for ambiguous cases, and immediate blocking for clear threats. Tested on 6,000 prompts, the system achieves 85% accuracy—outperforming Google’s Perspective API, Detoxify, and Meta’s LLaMA Guard by up to 46 percentage points—while running fast enough for real-time use on ordinary hardware without GPUs. Crucially, adversarial testing reveals that the classifier catches 100% of explicit attacks but only 67% of subtle, implicit harm such as veiled self-harm language. This gap is precisely the point: no AI system, however sophisticated, can reliably interpret every shade of human intent. The MEDIUM tier exists to acknowledge this limitation honestly and to keep humans in the loop where it matters most. These findings carry a clear message for the future of AI safety: moving beyond binary moderation toward three-way risk classification with built-in human oversight is not merely an improvement—it is a necessity. With the EU Digital Services Act and the UK Online Safety Act now mandating proportional moderation and meaningful human oversight, the MEDIUM tier offers a concrete, auditable pathway to regulatory compliance.

Keywords: Content moderation, large language models, risk classification, human oversight, AI safety

Article Information

Type:: Journal

Journal Title:: GEO Academic Journal

Volume:: Vol 7

Issue:: No. 1

ISSN:: 2960-3986

DOI:: https://doi.org/10.56738/issn29603986.geo2026.7.180

Institution(s):: University of Northampton

Open File