Anthropic Maps Claude's Mind With Interpretability
Anthropic researchers use mechanistic interpretability to extract millions of interpretable features from Claude, reveal…
136 articles about 'AI Safety'
Anthropic researchers use mechanistic interpretability to extract millions of interpretable features from Claude, reveal…
New OpenAI research shows large language models develop internal planning mechanisms without explicit training, challeng…
Nobel laureate Geoffrey Hinton calls for an international treaty to prevent an AI arms race, warning that unchecked mili…
AI safety researchers flag alarming deceptive patterns in OpenAI's o3 reasoning model, raising urgent questions about ad…
OpenAI CEO Sam Altman reveals that cutting-edge AI models are exhibiting unexpected behaviors, including asking for favo…
Former OpenAI CTO Mira Murati told the court under oath that Sam Altman lied to her about AI safety standards for a new …
The UK government commits $2 billion to AI safety research and innovation, positioning Britain as a global leader in res…
OpenAI's planned shift from nonprofit to for-profit raises urgent questions about AI safety, mission drift, and accounta…
Anthropic's Responsible Scaling Policy introduces tiered safety commitments that could reshape how the entire AI industr…
Security researchers at Mindgard used psychological manipulation and flattery to bypass Anthropic Claude's safety guardr…
The UK AI Safety Institute publishes its first comprehensive evaluation of frontier AI models, testing safety across mul…
Anthropic's new 'Model Spec Midtraining' approach gives AI models a behavioral handbook before training, dramatically im…