A mental model: data is "radioactive" if it could be tied to someone's identity.

2024-08-12 · Bits and Bobs 8/12/24

If that data touches other data, or is shown or shared in the wrong context, there could be an explosion.

But it's possible to "denature" datastreams, to make them impossible to identify users (below some differential privacy epsilon).

If you do this early in the data pipeline, then the data is much less dangerous and can be used more easily for things like improving quality of the system by wisdom of the crowds style approaches.

So why doesn't everyone do this by default?

Because if it turns out you need some facet of the data that was removed in the denaturing process, you're screwed.

In that case, all of the data you have sitting around is useless.

You have to have very good knowhow and expertise in a given quality domain to know what aspects of the data are important to maintain.

And in novel domains, it's not easy to know what facets of the data will be most useful. It requires experimentation.

Denaturing data basically removes its option value… the amount of harm data can do but also the amount of good things it can do.

Denaturing data early in the pipeline makes it very hard to do open-ended quality experimentation.

Finally, it's very hard to prove to users that you're doing this.

This is typically an internal implementation detail, with no external visibility or contracts (this lack of transparency makes it so you can change it later).

But if you do it and don't tell users, then outside of reducing catastrophic tail risk of a data leak, you don't get much benefit from doing it.

No benefit, all cost means that companies don't do it very often.

In practice companies say "screw it we'll keep it all… just in case. We're not like those other idiots who will have a data breach"

But then of course in the future there is a data breach, and users lose out.

Users learn to be generally wary of any new entity collecting their data, in a fuzzy, hard to articulate way.

Companies erroneously conclude "users don't really care about privacy anyway, just look not at what they say but their actions".

But users never really had much of a choice.

A mental model: data is "radioactive" if it could be tied to someone's identity.

More on this topic