END-TO-END MUSIC REMASTERING SYSTEM

ABSTRACT

Mastering is an essential step in music production, but it is also a challenging task that has to go through the hands of experienced audio engineers, where they adjust tone, space, and volume of a song. Remastering follows the same technical process, in which the context lies in mastering a song for the times. As these tasks have high entry barriers, we aim to lower the barriers by proposing an end-to-end music remastering system that transforms the mastering style of input audio to that of the target. The system is trained in a self-supervised manner, in which released pop songs were used for training. We also anticipated the model to generate realistic audio reflecting the reference's mastering style by applying a pre-trained encoder and a projection discriminator. We validate our results with quantitative metrics and a subjective listening test and show that the model generated samples of mastering style similar to the target.

Audio Samples

Test Samples Remastering Application

Results of Music Remastering from the proposed model.
The samples in this section were generated using our test dataset and were also used for the listening test.

Please use devices such as speakers, headphones, and earphones in a quiet environment to analyze the sound source.

Sample Index
#1	+	→
#2	+	→
#3	+	→
#4	+	→
#5	+	→
#6	+	→
#7	+	→
#8	+	→

Music Remastering application. The source track is remastered with the reference's mastering style.

Original Record (Network Input)	Reference Track	Remastered Track (Network Output)

PROPOSED METHOD

The Mastering Cloner aims to convert the mastering effects of the input track A1 to that of the reference track B2 by conditioning the encoded feature of the reference track 2' extracted from the pre-trained Music Effects Encoder. For this procedure, tracks A and B applied with the same mastering manipulation are used as a ground truth A2 and reference track B2, where A1 is another manipulated sample and used as the input of the Mastering Cloner. The discriminator is applied with the expectation of generating realistic sounds and similar mastering effects to the projected track.

EVALUATION

The metrics of quantitative evaluation include RMS difference of stereo channels, RMS difference of side channels (RMS-side), frequency-weighted segmental Source-to-Noise Ratio (fw-SNR), and short-time objective intelligibility (STOI) to measure the performance of the volume, stereo width, tone and timbre, and perceptuality, rescpectively. We evaluate the difference between the target mastered track with our model outputs.

The listening test was conducted with 17 participants who are familiar with the concept of mastering. The participants were given a total of 15 questions and were instructed to rate each samples according to similarity of the mastering style to the reference sample on a scale from 0 to 1.
We found a significant difference with p < 0.001, with conducted multiple post-hoc paired t-tests with Bonferroni correction for each anchor with our generated samples.

SUPPLEMENTARY MATERIALS

Quantitative Results According to Reference Track's Duration

The Music Effects Encoder is capable of encoding variable-length inputs. The results below are the quantitative results of the Mastering Cloner using different fixed duration (in seconds) of the reference track.

Mastering Effects Manipulator

The Mastering Effects Manipulator mimics the procedure of the Mastering Chain. The input track is manipulated in the order of Pre-gain → Equalizer → Stereo Imager → Maximizer and transformed into a different mastering style. (For Pre-gain, we fix the gain value to -8.0 dB.)

Original	Manipulated #1	Manipulated #2

The details of the parameters with their random value range are shown below.

Equalizer (5-band)					Stereo Imager (4-band)					Maximizer (Compressor)
Parameter	Min.	Max.	Units	Type	Parameter	Min.	Max.	Units	Type	Parameter	Min.	Max.	Units	Type
low_shelf_gain	-15	10	dB	int	freq_1	20	1500	Hz	int	threshold	-10	-6	dB	int
low_shelf_freq	20	1000	Hz	int	freq_2	1500	7000	Hz	int	attack_time	0.1	30.0	ms	float
first_band_gain	-15	10	dB	int	freq_3	7000	20000	Hz	int	release_time	50	100	ms	int
first_band_freq	200	5000	Hz	int	side_balance_1	0.0	3.0		float	ratio	5	10		int
first_band_q	5	30		int	side_balance_2	0.0	3.0		float	makeup_gain	4	12	dB	int
second_band_gain	-15	10	dB	int	side_balance_3	0.0	3.0		float
second_band_freq	500	6000	Hz	int	bal_4	0.0	3.0		float
second_band_q	5	30		int
third_band_gain	-15	10	dB	int
third_band_freq	2000	10000	Hz	int
third_band_q	5	30		int
high_shelf_gain	-15	10	dB	int
high_shelf_freq	8000	20000	Hz	int