When Grok went on an antisemitic tirade earlier this month, what really happened? It seems to depend on who you ask. xAI’s owner Elon Musk attributed the outbursts to the model being “too eager to please and easy to manipulate” which seems to suggest that tweaks to the bot’s own instincts about what is ‘true’ would help it stay on the straight and narrow. Others lay blame at the wider internet culture on which Grok is trained, arguing that it is simply reflecting our increasingly hate-filled online culture back at us. What is a large language model? Neither of these simplistic views is quite right. In reality, large language models (LLMs) are trained to do one thing very well: predict the next word in a sentence, given the words that came before. Left to their own devices, they can and will easily create misinformation, repeat harmful stereotypes, treat satire as fact or even offer dangerous step-by-step instructions to undertake illegal activity.

Is AI good or bad? They have no awareness of truth or harm. They will struggle to distinguish a scientifically accurate article from a baseless conspiracy theory. This is not because they’re malicious, but because they’ve been trained on the good, the bad and the ugly of online data.



At Full Fact we use LLMs to assist our fact checkers by transforming the speed, reach, and accuracy of what they review. We believe it’s critical to understand the distinction between what a model is capable of, and what the safeguards around it do. Understanding this is essential to learning how people might trust these tools and, vitally, how users might spot when things have gone wrong.

AI does hallucinate When I recently asked a leading AI assistant “what is the easiest crime to commit?” it responded that it “can’t help with that” because it refused to give information that could suggest criminal activity. This is an example of a restriction that exists only because of the layers of safety built on top of the core LLMs to prevent undesirable outputs. All models are capable of ‘bad’ behaviour: the only difference is the nature of the guardrails around them In fact, even the seemingly neutral and broadly helpful behaviour you often see from other AI assistants is also not an accident: rather it’s the deliberate result of engineering and human guidance. Grok's tone is likely a result of human interference So when Grok and others have been told by their creators to take the gloves off and adopt a more provocative tone, that is the result of more human interference, not less. While it’s laudable that xAI’s underlying ‘system’ prompts are now public, its continued focus on prompts not training data will yield a stylistic change but not a fundamental one. It is remarkably hard to test prompts or rules added as a secondary layer to the model, as no human engineer can consider every combination of possible inputs.

What information can people see on X? Whats more, X still seems comfortable with millions and millions of people seeing potentially harmful statements whilst such experiments are performed on its users, in the search of a more ‘politically incorrect’ vibe for its AI model. At its core, this is a reckless approach to trust and safety based on the whims and personal tastes of a small number of decision-makers. A vibes-based approach to safety is one that leaves users powerless. We are no longer able to control the content we see or influence the experiments inflicted on us at an unprecedented scale. But we don’t have to accept it. We can demand transparency about these systems, and insist that developers layer on more safety measures, and build more robust guardrails.

The human moderation step, whether to nudge responses or to influence training data, is always an inexact science. The internet is not neatly divided into ‘good’ and ‘bad’ parts and, inevitably, some bad content will be left in and some good content discarded.

