Model-Based Deep Speech Enhancement for Improved Interpretability and Robustness

Fang, Huajian

DC Element	Wert	Sprache
dc.contributor.advisor	Gerkmann, Timo	-
dc.contributor.advisor	Wermter, Stefan	-
dc.contributor.author	Fang, Huajian	-
dc.date.accessioned	2025-07-03T12:33:05Z	-
dc.date.available	2025-07-03T12:33:05Z	-
dc.date.issued	2025	-
dc.identifier.uri	https://ediss.sub.uni-hamburg.de/handle/ediss/11746	-
dc.description.abstract	Technology advancements profoundly impact numerous aspects of life, including how we communicate and interact. For instance, hearing aids enable hearing-impaired or elderly people to participate comfortably in daily conversations; telecommunications equipment lifts distance constraints, enabling people to communicate remotely; smart machines are developed to interact with humans by understanding and responding to their instructions. These applications involve speech-based interaction not only between humans but also between humans and machines. However, the microphones mounted on these technical devices can capture both target speech and interfering sounds, posing challenges to the reliability of speech communication in noisy environments. For example, distorted speech signals may reduce communication fluency among participants during teleconferencing. Additionally, noise interference can negatively affect the speech recognition and understanding modules of a voice-controlled machine. This calls for speech enhancement algorithms to extract clean speech and suppress undesired interfering signals, improving the overall quality and intelligibility of speech. Traditional speech enhancement algorithms often rely on simplifying assumptions, such as slowly changing noise, to estimate the parameters required for clean speech estimators. This may lead to less than satisfactory results in acoustically challenging scenarios. In recent years, the field has seen great strides through deep learning-based algorithms. The success of deep learning stems largely from its universal function approximation capability and scalability to large datasets. In particular, deep predictive approaches have received widespread attention due to their remarkable flexibility in incorporating key features of the target speech into various stages of the speech enhancement framework. These stages include input feature processing, network architecture design, training objective formulation, and optimization strategy development. Essentially, deep predictive methods aim to learn a mapping between noisy mixtures and clean speech by training deep neural networks (DNNs) on a large number of paired noisy-clean speech samples. However, the performance of these algorithms depends heavily on the quantity and diversity of training data. As a result, performance degradation often occurs when there is a data mismatch between training and testing, known as the generalization problem. Moreover, predictive approaches are typically framed as problems with a single output, which may result in erroneous estimates for complex and unseen samples without any indication of uncertainty. Indeed, due to the black-box nature of DNNs, deep learning-based algorithms produce clean speech estimates in a non-transparent manner, making them difficult to interpret. In this thesis, we aim to incorporate statistical models into DNN-based speech enhancement to improve its robustness and interpretability. The first part of the thesis explores these ideas from the perspective of uncertainty. We augment predictive speech enhancement with an uncertainty estimation task, such that the network model can provide not only clean speech estimates but also their associated predictive uncertainty. Furthermore, since generic Bayesian methods for uncertainty modeling in deep learning usually involve costly sampling processes, this thesis seeks to leverage statistical knowledge from the speech processing domain to efficiently estimate uncertainty with minimal computational overhead. We experimentally demonstrate that the proposed uncertainty-augmented framework effectively identifies when predictions deviate significantly from the true data by producing large uncertainty estimates. This allows us to assess the model's confidence in predictions when clean speech ground truth is unavailable. Additionally, we show that the uncertainty-augmented methods grounded in statistical modeling improve speech enhancement performance compared to methods that predict a single filter mask only. Next, we explore the direct use of uncertainty estimates for speech enhancement tasks. This includes unsupervised domain adaptation, where we utilize uncertainty-based filtering to select high-quality pseudo-targets to alleviate generalization issues. In another application, alongside audio inputs, we further explore modeling uncertainty originating from distorted video signals in an audio-visual phoneme classification task and demonstrate how to exploit modality-wise uncertainty to achieve more effective and robust multimodal fusion. In the second part of the thesis, we investigate the issues of interpretability and robustness by focusing on deep generative approaches. In contrast to predictive approaches that learn a deterministic mapping between noisy and clean speech, deep generative approaches aim to learn prior distributions of given data and reuse this knowledge to perform speech enhancement during inference. In the thesis, we consider a specific group of methods, which use a variational autoencoder (VAE) to learn a prior distribution of clean speech and combine it with an untrained non-negative matrix factorization (NMF)-based noise model to estimate a filter mask for speech enhancement. The statistically interpretable VAE-NMF framework exhibits an improved generalization ability to unseen acoustic conditions compared to predictive methods. However, training the VAE solely with clean speech makes it susceptible to noise interference during testing, especially for inputs with low signal-to-noise ratios. In this part, we aim to improve overall robustness in difficult acoustic conditions by augmenting separately the speech and noise models with noise information. The resulting noise-aware speech and noise models retain the high interpretability provided by statistical modeling while at the same time exhibiting improved speech enhancement performance in acoustically challenging environments.	en
dc.language.iso	en	de_DE
dc.publisher	Staats- und Universitätsbibliothek Hamburg Carl von Ossietzky	de
dc.rights	http://purl.org/coar/access_right/c_abf2	de_DE
dc.subject	Speech Enhancement	en
dc.subject	Model-based Approaches	en
dc.subject	Uncertainty Estimation	en
dc.subject	Predictive and Generative Modeling	en
dc.subject.ddc	004: Informatik	de_DE
dc.title	Model-Based Deep Speech Enhancement for Improved Interpretability and Robustness	en
dc.type	doctoralThesis	en
dcterms.dateAccepted	2025-06-02	-
dc.rights.cc	https://creativecommons.org/licenses/by/4.0/	de_DE
dc.rights.rs	http://rightsstatements.org/vocab/InC/1.0/	-
dc.subject.gnd	Sprachverarbeitung	de_DE
dc.subject.gnd	Maschinelles Lernen	de_DE
dc.type.casrai	Dissertation	-
dc.type.dini	doctoralThesis	-
dc.type.driver	doctoralThesis	-
dc.type.status	info:eu-repo/semantics/publishedVersion	de_DE
dc.type.thesis	doctoralThesis	de_DE
tuhh.type.opus	Dissertation	-
thesis.grantor.department	Informatik	de_DE
thesis.grantor.place	Hamburg	-
thesis.grantor.universityOrInstitution	Universität Hamburg	de_DE
dcterms.DCMIType	Text	-
dc.identifier.urn	urn:nbn:de:gbv:18-ediss-129133	-
item.languageiso639-1	other	-
item.fulltext	With Fulltext	-
item.advisorGND	Gerkmann, Timo	-
item.advisorGND	Wermter, Stefan	-
item.grantfulltext	open	-
item.creatorOrcid	Fang, Huajian	-
item.creatorGND	Fang, Huajian	-
Enthalten in den Sammlungen:	Elektronische Dissertationen und Habilitationen

Dateien zu dieser Ressource:

Datei	Beschreibung	Prüfsumme	Größe	Format
Dissertation_Fang_FINALVERSION_signed.pdf		2af31767bfd2832958911881efcf220c	5.35 MB	Adobe PDF	Öffnen/Anzeigen

Zur Kurzanzeige

Info

Seitenansichten

Letzte Woche

Letzten Monat

geprüft am null

Download(s)

Letzte Woche

Letzten Monat

geprüft am null

Werkzeuge

Google Scholar^TM

Prüfe

Dateien zu dieser Ressource:

Seitenansichten

Download(s)

Google ScholarTM

Google Scholar^TM