Anthropic's Claude 4 could "blackmail" you in extreme situations

Pro@programming.dev · 21 days ago

Anthropic's Claude 4 could "blackmail" you in extreme situations

neukenindekeuken@sh.itjust.works · 20 days ago

Here’s the relevant section from the paper:

(It’s worth the read. Pretty much pure gold.)

What nobody seems to explain is, why are they allowing the model to do blackmail in the first place? Even in extreme situational “danger” to its self-preservation, we should probably take blackmail off the table, ethically. Yet, they’re implying they’ve intentionally left it in as an option, if it decides.

Morally though, we can’t trust it to do arithmetic or not talk about “white genocide in SA” thanks to muskrat. Why should we trust its moral model/choices for when to decide to employ unethical and illegal approaches to solutions?

Ricky Rigatoni@lemm.ee · 18 days ago

JUST LIKE STARSECTOR.

milicent_bystandr@lemm.ee · 20 days ago

From that snippet, it looks like they basically primed it to try blackmail, to see if it would.

Anthropic's Claude 4 could "blackmail" you in extreme situations

Anthropic's Claude 4 could "blackmail" you in extreme situations

Anthropic's Claude 4 could "blackmail" you in extreme situations - Hypertext