End-to-end
Music Remastering System
Using Self-supervised and Adversarial Training



Junghyun Koo, Seungryeol Paik, and Kyogu Lee
Music and Audio Research Group (MARG), Seould National University



Paper: https://arxiv.org/abs/2202.08520

Code: https://github.com/jhtonyKoo/e2e_music_remastering_system

ABSTRACT

Mastering is an essential step in music production, but it is also a challenging task that has to go through the hands of experienced audio engineers, where they adjust tone, space, and volume of a song. Remastering follows the same technical process, in which the context lies in mastering a song for the times. As these tasks have high entry barriers, we aim to lower the barriers by proposing an end-to-end music remastering system that transforms the mastering style of input audio to that of the target. The system is trained in a self-supervised manner, in which released pop songs were used for training. We also anticipated the model to generate realistic audio reflecting the reference's mastering style by applying a pre-trained encoder and a projection discriminator. We validate our results with quantitative metrics and a subjective listening test and show that the model generated samples of mastering style similar to the target.

Audio Samples


Results of Music Remastering from the proposed model.
The samples in this section were generated using our test dataset and were also used for the listening test.

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.

Sample Index Reference Track Network Input Network Output Random Manipulation Ground Truth
#1 +
#2 +
#3 +
#4 +
#5 +
#6 +
#7 +
#8 +

Music Remastering application. The source track is remastered with the reference's mastering style.

Original Record
(Network Input)
Reference Track Remastered Track
(Network Output)

PROPOSED METHOD

The Mastering Cloner aims to convert the mastering effects of the input track A1 to that of the reference track B2 by conditioning the encoded feature of the reference track 2' extracted from the pre-trained Music Effects Encoder. For this procedure, tracks A and B applied with the same mastering manipulation are used as a ground truth A2 and reference track B2, where A1 is another manipulated sample and used as the input of the Mastering Cloner. The discriminator is applied with the expectation of generating realistic sounds and similar mastering effects to the projected track.

EVALUATION



The metrics of quantitative evaluation include RMS difference of stereo channels, RMS difference of side channels (RMS-side), frequency-weighted segmental Source-to-Noise Ratio (fw-SNR), and short-time objective intelligibility (STOI) to measure the performance of the volume, stereo width, tone and timbre, and perceptuality, rescpectively. We evaluate the difference between the target mastered track with our model outputs.




The listening test was conducted with 17 participants who are familiar with the concept of mastering. The participants were given a total of 15 questions and were instructed to rate each samples according to similarity of the mastering style to the reference sample on a scale from 0 to 1.
We found a significant difference with p < 0.001, with conducted multiple post-hoc paired t-tests with Bonferroni correction for each anchor with our generated samples.

SUPPLEMENTARY MATERIALS


Quantitative Results According to Reference Track's Duration

The Music Effects Encoder is capable of encoding variable-length inputs. The results below are the quantitative results of the Mastering Cloner using different fixed duration (in seconds) of the reference track.






Mastering Effects Manipulator

The Mastering Effects Manipulator mimics the procedure of the Mastering Chain. The input track is manipulated in the order of Pre-gain → Equalizer → Stereo Imager → Maximizer and transformed into a different mastering style. (For Pre-gain, we fix the gain value to -8.0 dB.)

Original Manipulated #1 Manipulated #2


The details of the parameters with their random value range are shown below.

Equalizer (5-band) Stereo Imager (4-band) Maximizer (Compressor)
Parameter Min. Max. Units Type Parameter Min. Max. Units Type Parameter Min. Max. Units Type
low_shelf_gain -15 10 dB int freq_1 20 1500 Hz int threshold -10 -6 dB int
low_shelf_freq 20 1000 Hz int freq_2 1500 7000 Hz int attack_time 0.1 30.0 ms float
first_band_gain -15 10 dB int freq_3 7000 20000 Hz int release_time 50 100 ms int
first_band_freq 200 5000 Hz int side_balance_1 0.0 3.0 float ratio 5 10 int
first_band_q 5 30 int side_balance_2 0.0 3.0 float makeup_gain 4 12 dB int
second_band_gain -15 10 dB int side_balance_3 0.0 3.0 float
second_band_freq 500 6000 Hz int bal_4 0.0 3.0 float
second_band_q 5 30 int
third_band_gain -15 10 dB int
third_band_freq 2000 10000 Hz int
third_band_q 5 30 int
high_shelf_gain -15 10 dB int
high_shelf_freq 8000 20000 Hz int