Reverb Conversion
of Mixed Vocal Tracks Using
an End-to-end Convolutional Deep Neural Network



Junghyun Koo, Seungryeol Paik, and Kyogu Lee
Music and Audio Research Group (MARG), Seould National University



Paper: https://arxiv.org/abs/2103.02147

ABSTRACT

Reverb plays a critical role in music production, where it provides listeners with spatial realization, timbre, and texture of the music. Yet, it is challenging to reproduce the musical reverb of a reference music track even by skilled engineers. In response, we propose an end-to-end system capable of switching the musical reverb factor of two different mixed vocal tracks. This method enables us to apply the reverb of the reference track to the source track to which the effect is desired. Further, our model can perform de-reverberation when the reference track is used as a dry vocal source. The proposed model is trained in combination with an adversarial objective, which makes it possible to handle high-resolution audio samples. The perceptual evaluation confirmed that the proposed model can convert the reverb factor with the preferred rate of 64.8%. To the best of our knowledge, this is the first attempt to apply deep neural networks to converting music reverb of vocal tracks.

PROPOSED METHOD

The proposed model is a modified version of the U-Net, which is trained to disentangle the reverb factor of the input and convert them into those of counterpart input.

Evaluation

The quantitative evaluation includes two tasks;
  Reverb Conversion: interchanging reverb of two different inputs. We evaluate our metrics with a comparison between target reverberated and interchanged samples. A higher value represents a better result in all the metrics used.




  De-reverberation: eliminating reverb of the target input. Values of the x-axis below are the percentage of bus send ratio (γ) set for mixing source and reverb factor. The unit for STOI is percent(%), and SRMR and SI-SDR are in decibel (dB). A higher value represents a better result in all the metrics used.




The listening test was conducted with twenty participants. The participants were randomly given one of two different test sets with twenty-four questions each. For each question, three samples are presented - a reference sample, which is the output of the proposed model, with two different samples, which are an input of the model and ground truth of the reference sample (GT).




Below are visual examples of W→D and D→W samples.




Audio Samples


Results Reverb Conversion from the proposed model. All samples were generated from our validation dataset.
Four reverb presets were used in the validation dataset, where the details are as follows.
Preset Plug-in Compamy
Smooth Vocal H-Reverb Waves
Vocal Plate Abbey Road Plates Waves
Vocal Hall ChromaVerb Logic Pro-X
Vocal Chamber ChromaVerb Logic Pro-X

The samples used in this section were also used in the listening test.

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.

No. / Δγ
Reverb 1 (r1) / γ
Reverb 2 (r2) / γ
Model Input Model Output Ground Truth
Source / Reverb Audio Sample Source / Reverb Audio Sample Source / Reverb Audio Sample
#1 / 0%
Smooth Vocal / 15%
Vocal Plate / 15%
sa / r1 sa / r2 sa / r2
sb / r2 sb / r1 sb / r1
#2 / 20%
Vocal Plate / 5%
Vocal Hall / 25%
sa / r1 sa / r2 sa / r2
sb / r2 sb / r1 sb / r1
#3 / 40%
Vocal Hall / 5%
Vocal Plate / 45%
sa / r1 sa / r2 sa / r2
sb / r2 sb / r1 sb / r1
#4 / 60%
Smooth Vocal / 5%
Vocal Chamber / 65%
sa / r1 sa / r2 sa / r2
sb / r2 sb / r1 sb / r1

Results of De-reverberation from the proposed model. All samples were generated from our validation dataset.
Four reverb presets were used in the validation dataset, where the details are as follows.
Preset Plug-in Compamy
Smooth Vocal H-Reverb Waves
Vocal Plate Abbey Road Plates Waves
Vocal Hall ChromaVerb Logic Pro-X
Vocal Chamber ChromaVerb Logic Pro-X

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.

γ Reverb Model Input Model Output Ground Truth
10% Smooth Vocal
Vocal Plate
20% Vocal Hall
Vocal Chamber
30% Smooth Vocal
Vocal Hall
40% Vocal Chamber
Vocal Plate
50% Vocal Plate
Vocal Chamber
60% Smooth Vocal
Vocal Hall
70% Vocal Plate
Vocal Chamber

Reverb Conversion with pop songs and raw tracks. Reference track (a pop song) is de-reverberated, while the raw track is added with the reverb factor of the reference track.

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.


Pop Song (ref.) Model Input Model Output
"The Scientist" by Coldplay ref.
raw
"Yellow" by Coldplay ref.
raw
"Attention" by Charlie Puth ref.
raw
"Attention" by Charlie Puth ref.
raw
"Greedy" by Ariana Grande ref.
raw
"Greedy" by Ariana Grande ref.
raw