Izotope RX or Ultimate Vocal Remover?
If not.. or possibly in combination..
What are the good stereo qualities of the stack tape you would like to retain? Does it convey actual stereo image placement of sound sources? Or is the stereo aspect of it mostly an more open non-monophonic portrayal of audience. reverberance, and room sound? If more the second than the first (which is typical of a stack tape), and the clean channel sounds good on its own in isolation except for being less involving and somewhat dimensionless in a stereo sense, you might use just that channel and apply some flavor of pseudo stereo processing to it.
Advantages of that approach are that the work will mostly involve dialing in the most appropriate pseudo-stereo sound, followed by applying that to the entire recording, rather than doing lots of cross-fading (less post work), and it avoid transitions back and forth that could be noticeable and distracting.
There are a number of pseudo-stereo techniques you might use, which has been covered in other TS threads. Most of those will use just the "good" channel and discard the other one. I like using Mid/Side processing for stuff like this, partly because it can retain both original channels- if you can EQ the channel that has the extraneous singing in it such that the voice is significantly attenuated (deeply scoop out the mids, leaving the lows and highs - don't worry that it sounds incorrect on its own) you can use that as the Side channel and the clean channel as the Mid channel. The channel with the attenuated signing then becomes the "stereo effect channel". After Mid/Side processing its content will be heard out to either side, with the clean channel dominating the center of the playback image. Stereo involvement will manifest mostly in the low and high frequency regions, while the midrange will be focused and centered. Any attenuated extraneous singing that may still be audible will tend to be diffuse and out to the sides.