Adapting the Frechet Audio Distance as an Objective Metric for Text-to-Speech Quality Evaluation

Konferenz: Speech Communication - 16th ITG Conference
24.09.2025-26.09.2025 in Berlin, Germany

Tagungsband: ITG-Fb. 321: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Zavistanavicius, Laurynas; Zalkow, Frank; Dittmar, Christian; Stevenson, Robert L.

Inhalt:
Within research on text-to-speech (TTS) synthesis, evaluation metrics are necessary for assessing the quality of synthesized speech. While listening tests are the gold standard, they are often costly and time-consuming, motivating the search for objective alternatives. Traditional signal processing metrics, such as Mel Cepstral Distortion (MCD), do not necessarily align with human perception of speech quality, and machine learning approaches typically require training on listening test data. Although the Frechet Distance has been applied in various domains, its specific correlation with human evaluation of speech quality remains unexplored. Our investigation reveals that, in certain embedding spaces, the Frechet Audio Distance (FAD) correlates strongly with human evaluations of speech quality, without relying on listening test scores to compute this metric. Therefore, we propose FAD as a predictor of listening test scores, offering a promising objective metric for TTS quality assessment.