Transfer Learning using Musical/Non-Musical Mixtures for Multi-Instrument Recognition

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164009

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Bradl, Hannes (Joanneum Research Forschungsgesellschaft mbH, Austria)
Huber, Markus (sonible GmbH, Graz, Austria)
Pernkopf, Franz (Christian Doppler Laboratory for Dependable Intelligent Systems in Harsh Environments, Signal Processing and Speech Communication Lab., Graz University of Technology, Austria)

Inhalt:
Datasets for most music information retrieval (MIR) tasks tend to be relatively small. However, in deep learning, insufficient training data often leads to poor performance. Typically, this problem is approached by transfer learning (TL) and data augmentation. In this work, we compare various of these methods for the task of multi-instrument recognition. A convolutional neural network (CNN) is able to identify eight instrument families and seven specific instruments from polyphonic music recordings. Training is conducted in two phases: After pre-training with a music tagging dataset, the CNN is retrained using multi-track data. Experimenting with different TL methods suggests that training the final fully-connected layers from scratch while fine-tuning the convolutional backbone yields the best performance. Two different mixing strategies – musical and non-musical mixing – are investigated. It turns out that a blend of both mixing strategies works best for multi-instrument recognition.