A search engine's quality is determined by different inputs.

Those inputs are transformed by an algorithm into the outputs, the Search Engine Results Page (SERP).

The inputs for a search engine are:

The public crawlable internet

The querystream / clickstream of users on the search engine itself.

What people search for, and what results users click on.

The former is visible to every competitor; anyone can presumably have a similar index (including pagerank calculation) as anyone else.

There's lots of data in the link graph, but it's data anyone could recreate.

But the differentiated inputs are the querystream and clickstream.

Those are two extremely high potency signals.

The proprietary access to those signals gives a very strong data network effect to the lead search engine.

AI has a few similar structural things.

The common crawl is (presumably) used by almost every LLM.

There's also proprietary "querystream" of each of the models.

When the models are used via the API, most providers contractually agree to not use the queries for training.

For the direct consumer application created by the providers, some (like OpenAI) reserve the right to use the querystream to train.

Interestingly, if I understand correctly, Anthropic explicitly says they won't use the querystream to train their models.

More on this topic