In my YouTube demo the music sample was chosen for a good range of frequencies, together with transients amid held tones to reveal any compression happening. I thought it sounded fine, but younger ears might be more critical! I can imagine that the device is optimised for voice, but nobody wants thin sounding or muffled voices.
[Edited to add that I just noticed that YouTube have added a link to the music I used, in the description element of my video. This makes it very easy to compare the original track, against my recording of that track being played on my Tannoy bookshelf speakers, sourced from YT Music running on my TV. Of course the speakers and the room etc will degrade the sound, but for all that it's better than I would have expected.]