๐Ÿ“„ RESEARCH PAPER

BlackWall โ€“ Protect an AI from going rogue via an AI

Parsa Besharat

Department of Math and Computer Science, Technische Universitรคt Bergakademie Freiberg, Freiberg, Germany

parsa.besharat@student.tu-freiberg.de ORCID: 0009-0006-0867-4511

Abstract

The increasing integration of social media and conversational AI into daily life has intensified concerns around the spread of harmful, illegal, and psychologically sensitive content, including suicidal ideation, self-harm, depression, and other forms of negative influence. Recent cases of emotional attachment to chatbots and instances where AI systems have unintentionally misled users on mental health issues highlight the urgency of reliable safety mechanisms. This paper presents Blackwall, a domain-aware and interpretable framework designed to identify, assess, and rank high-risk content across online platforms. By operating across heterogeneous data sources and providing transparent risk explanations, Blackwall supports early intervention, responsible moderation, and safer humanโ€“AI interaction. The framework aims to contribute toward ethically grounded content safety systems that can mitigate psychological harm while preserving transparency and trust.

Keywords

AI Safety Suicidal Ideation Detection Interpretable AI Generative AI Moderation Rogue AI Prevention