LLMs don't distinguish between passive context and active instructions.

2025-02-18 · Bits and Bobs 2/18/25

LLMs don't distinguish between passive context and active instructions.
- An example of an instruction: "distill this context into 5 funny examples".
- There's no way to delineate between the two.
- Code is inert unless executed by a parser and executer tuned for it.
- An input stream is only dangerous if it turns out to be executable and you execute it or are tricked into executing it now or downstream.
- You can structurally break any unexpected code that's in the path of execution since there are strict grammars it needs to fit in.
  - You can spoil any possibly malicious code very easily.
  - There are inert regions in strings, e.g inside of quotes, so you can make sure any malicious bits are included in non-executable strings, for example.
- Parsing is the gate.
- Execution is the danger.
- You can mangle data so even if it's malicious it won't parse or won't execute.
  - Make it so if it's dangerous it will be mangled enough to jam the machine before it successfully executes.
- But English is always the same in either situation, so these techniques don't work.
- It's not possible to structurally mangle to make sure it won't be "executable".
- That means that any text that you want to be inert parts of your "context" might accidentally include "executable" instructions that the LLM follows.
- There's no good way to defend against it!