The original post: /r/cybersecurity by /u/petitlita on 2024-10-13 12:32:09.

It’s far too easy for an attacker to control practically every level of an LLM - the dataset, model, all parts of the prompt, and as a result, the output. Like there’s attacks on agentic models that are basically as easy as phishing but can get you RCE. The fact is that responses by nature have to leak some information about the model, which can be used to find a sequence of tokens that gets a desired response. It’s probably unrealistic to assume we can actually prevent someone from forcing an AI to act outside of its guardrails. Why are we treating them as trusted and hoping they will secure themselves?