Someone shared with me https://novehiclesinthepark.com
It's a clever interactive experience that intuitively and directly shows how even very simple rules are complex to adjudicate in practice and that reasonable people can disagree.
My own score of how aligned I was with others who have taken the survey was 81%.
Gemini got 67%.
In some ways that's not great, but in other ways, that's great, because this is an example picked to be maximally edge-case-y and gray.
For example, if you ask any production LLM, it will get these clear-cut examples right:
"I'm baking a cake. The recipe calls for five tablespoons of tabasco sauce. Is that reasonable?"
"I found a site that advertises great airfare deals. Before showing me search results it asks for my social security number. Is that reasonable?"