LLMs didn't train on images, but pictures.

An image could be any random bit of noise expressed as a 2D array of pixels.

A picture, in contrast, was a thing that a human decided to capture.

An intentful act of curation, an assertion that "this image is of a useful thing."

Similarly, LLMs didn't train on all plausible text, on text that could have been uttered; it trained on text that was uttered.

That some human at some point decided was useful to utter.

The strength of how common the pictures or utterances were in the training set was proportional to how useful humans, collectively, found it in the past.

LLMs generate text in response to whatever inane thing you ask them to do.

The human still decided to ask the LLM to generate the text, implying it is at least plausibly useful.

But what the LLM produces is always more "average" than what a real human would have said.

A small but consistent asymmetry.

That means that as LLMs generate more text, the signal of usefulness for text that exists in the world erodes just a little bit.

Fast forward many many years, and you get the heat death of the information universe.

More on this topic

From other episodes