To test the safety and security of AI, hackers have to trick large language models into breaking their own rules. It requires ingenuity and manipulation – and can come at a deep emotional costA few months ago, Valen Tagliabue sat in his hotel room watching his chatbot, and felt euphoric. He had just manipulated it so skilfully, so subtly, that it began ignoring its own safety rules. It told him how to sequence new, potentially lethal pathogens and how to make them resistant to known drugs.Tagliabue had spent much of the previous two years testing and prodding large language models such as Claude and ChatGPT, always with the aim of making them say things they shouldn’t. But this was one of his most advanced “hacks” yet: a sophisticated plan of manipulation, which involved him being cruel, vindictive, sycophantic, even abusive. “I fell into this dark flow where I knew exactly what to say, and what the model would say back, and I watched it pour out everything,” he says. Thank
UPVOTERS
Community appreciation
See who found this content valuable and showed their support.
No upvotes yet.
Be the first to show your appreciation for this content.
TOPICS
Explore the same topics
Discover more content from the topics this post is mapped to.
Keep browsing
Explore more from this topic
Dive into the full feed of curated posts covering Tech News & Product Launches.
Discussion
Take the lead—comment now
Lead the way—your insights can inspire others.