I guess rather than momentary silence, you could look at a zero crossing as the point where the speaker cone is moving through its resting position, despite the fact that it is in motion!
Moving at rest! A dilemma of modern life.
Yes that's nominally true at the speaker.
But it's a long path from the digital representation of the waveform in your computer to that speaker cone. DC offset can occur at different points along the way. A misaligned power amp with some degree of DC offset will cause a voltage difference across the terminals and some amount of current to flow through the voice coil at "rest", pushing or pulling the cone away from its nominal mechanical center resting point even though no signal is present. Ideally that's adjusted or compensated for, such that a zero crossing = zero voltage output = speaker driver at its physically balanced mechanical resting point.
Here at TS we're more familiar with DC offset on the recording side of things, where acoustic waveforms are not always symmetrical, where batteries powering microphones die slowly sometimes causing asymmetric distortion or clipping, and where amplification stages may distort or clip asymmetrically. Ideally, zero crossing should equate to what would be the position of the microphone diaphragm at rest (even though it's never actually at rest).
Determining where the zero crossing point of a digitally sampled waveform should be is somewhat arbitrary. We can shift it to any positive or negative value we chose and define that as zero. Should it be the average of all sample values? The averaged value of just the input noise in the absence of any other signal? The median between the highest and lowest peak values? It's easy to determine precisely where the top and bottom are, but the midway point is less well defined. It's essentially the center point between other well-defined things.