Just a quick comment.
Brute force conversion from 24 bits down to 16 by just chopping off the 8 lowest bits, should not result in your problems: the 16 most significant bits would remain the same.
But if the program you are using is applying dither to the trunacation process it could attempt
to add a little bit of "noise" to an already full scale 16 bit sample. Half the time that "noise"
is positive and the full scale sample "overflows".
I think, with some reservations due to a hasty read :-), that the problem here is how your
program deals with adding a small value to a full scale sample. Ideally it should remain full scale
(no change). It sounds as if your program has a diffrent policy on what to do with overflows.
Any chance we could see a display of the samples?