LLMs don't distinguish between passive context and active instructions.

· Bits and Bobs 2/18/25
  • LLMs don't distinguish between passive context and active instructions.
    • An example of an instruction: "distill this context into 5 funny examples".
    • There's no way to delineate between the two.
    • Code is inert unless executed by a parser and executer tuned for it.
    • An input stream is only dangerous if it turns out to be executable and you execute it or are tricked into executing it now or downstream.
    • You can structurally break any unexpected code that's in the path of execution since there are strict grammars it needs to fit in.
      • You can spoil any possibly malicious code very easily.
      • There are inert regions in strings, e.g inside of quotes, so you can make sure any malicious bits are included in non-executable strings, for example.
    • Parsing is the gate.
    • Execution is the danger.
    • You can mangle data so even if it's malicious it won't parse or won't execute.
      • Make it so if it's dangerous it will be mangled enough to jam the machine before it successfully executes.
    • But English is always the same in either situation, so these techniques don't work.
    • It's not possible to structurally mangle to make sure it won't be "executable".
    • That means that any text that you want to be inert parts of your "context" might accidentally include "executable" instructions that the LLM follows.
    • There's no good way to defend against it!