In your equal-level picture, one mic blocks off a significant amount of sound intended for its partner mic.
There's also a small issue, though, with the arrival times: your intensity-based stereo image would contradict the image provided by the arrival-time differences. When spacing is used, the R-pointing mic needs to be closer to the R-stack than the L-pointing mic, thus ensuring that any signal from the R-stack will be stronger _and_ earlier in the R-pointing mic, thus preventing a clash between the intensity- and arrival time info. I expect you knew all that, but I thought it worthwhile restating.
And why is it such a hassle to get XY...just use one or two thread adaptors as spacers?