Slight misalignments of under 20ms are generally too short to be perceived as a delay or echo but may be perceived in other ways - image shifts, image blurring, image width, tonal changes, depth changes, apparent local performance space size, stuff like that. One reason a unaccompanied talking and drumstick clicks are good signals for detecting slight misalignments is that those kinds of spatial effects are more easily identified as such using those simple "well known" clean sounds heard in relative islolation, whereas for musical signals those attributes might well be part of the original sound of the instrument.
Unconscience attention is drawn to the earlier of the arrivals, so if a SBD slightly proceeds an AUD the image may sound slightly closer and more present, whereas if the AUD precedes it may sound a bit wider and less center prominent.
Also, because short delays are not perceived as being a delay, they can be useful for creating stereo interest or for differentiating channels from each other. One simple but effective pseudo-stereo technique is the application of a short delay in one channel verses the other. That can be useful when mixing in a single monophonic channel of room ambience or reverb, or making less apparent a short repair consisting of one channel copied to the other to cover an intermittent flaw or dropout. For my surround playback stuff I sometimes introduce a slight delay into the rear facing mic channel(s) if there isn't enough front/back separation and the recording needs some help keeping the stuff in front from leaking into the surround channels too much without otherwise having to lower the level in those channels. Likewise, I might introduce an ever so slightly different delay to each surround channel so that a single monophonic channel of room and audience ambience conveys something of a similar openness and richness to using multiple recorded ambient channels.
Hope all this doesn't come across as pedantic, I'm sure may of you guys know this stuff well.