WYSIWYG systems almost always have some weird edge cases that are hard for users to reason about.
The fundamental reason is because the view is a reduced-dimension visual projection of the underlying semantic model.
In reducing dimensions, you must lose some of the nuance.
It's possible for there to be two visually equivalent view states that are different semantic states.
That means when a user modifies the visual projection, the system sometimes has to make judgment calls about how to resolve the ambiguity in the underlying semantic model.
Often there are good enough rules of thumb that work as intended most of the time… but there are always possibilities for nasty surprises lurking.
If the system makes the wrong guess, the user might not even notice it for some time.
The projection of the incorrect state is the same as the projection for the correct state.
There's no visual clue it got it wrong until later when the difference becomes obvious, but by then it's confusing and harder to correct.