DC ElementWertSprache
dc.contributor.advisorGerkmann, Timo-
dc.contributor.advisorWermter, Stefan-
dc.contributor.authorFang, Huajian-
dc.date.accessioned2025-07-03T12:33:05Z-
dc.date.available2025-07-03T12:33:05Z-
dc.date.issued2025-
dc.identifier.urihttps://ediss.sub.uni-hamburg.de/handle/ediss/11746-
dc.description.abstractTechnology advancements profoundly impact numerous aspects of life, including how we communicate and interact. For instance, hearing aids enable hearing-impaired or elderly people to participate comfortably in daily conversations; telecommunications equipment lifts distance constraints, enabling people to communicate remotely; smart machines are developed to interact with humans by understanding and responding to their instructions. These applications involve speech-based interaction not only between humans but also between humans and machines. However, the microphones mounted on these technical devices can capture both target speech and interfering sounds, posing challenges to the reliability of speech communication in noisy environments. For example, distorted speech signals may reduce communication fluency among participants during teleconferencing. Additionally, noise interference can negatively affect the speech recognition and understanding modules of a voice-controlled machine. This calls for speech enhancement algorithms to extract clean speech and suppress undesired interfering signals, improving the overall quality and intelligibility of speech. Traditional speech enhancement algorithms often rely on simplifying assumptions, such as slowly changing noise, to estimate the parameters required for clean speech estimators. This may lead to less than satisfactory results in acoustically challenging scenarios. In recent years, the field has seen great strides through deep learning-based algorithms. The success of deep learning stems largely from its universal function approximation capability and scalability to large datasets. In particular, deep predictive approaches have received widespread attention due to their remarkable flexibility in incorporating key features of the target speech into various stages of the speech enhancement framework. These stages include input feature processing, network architecture design, training objective formulation, and optimization strategy development. Essentially, deep predictive methods aim to learn a mapping between noisy mixtures and clean speech by training deep neural networks (DNNs) on a large number of paired noisy-clean speech samples. However, the performance of these algorithms depends heavily on the quantity and diversity of training data. As a result, performance degradation often occurs when there is a data mismatch between training and testing, known as the generalization problem. Moreover, predictive approaches are typically framed as problems with a single output, which may result in erroneous estimates for complex and unseen samples without any indication of uncertainty. Indeed, due to the black-box nature of DNNs, deep learning-based algorithms produce clean speech estimates in a non-transparent manner, making them difficult to interpret. In this thesis, we aim to incorporate statistical models into DNN-based speech enhancement to improve its robustness and interpretability. The first part of the thesis explores these ideas from the perspective of uncertainty. We augment predictive speech enhancement with an uncertainty estimation task, such that the network model can provide not only clean speech estimates but also their associated predictive uncertainty. Furthermore, since generic Bayesian methods for uncertainty modeling in deep learning usually involve costly sampling processes, this thesis seeks to leverage statistical knowledge from the speech processing domain to efficiently estimate uncertainty with minimal computational overhead. We experimentally demonstrate that the proposed uncertainty-augmented framework effectively identifies when predictions deviate significantly from the true data by producing large uncertainty estimates. This allows us to assess the model's confidence in predictions when clean speech ground truth is unavailable. Additionally, we show that the uncertainty-augmented methods grounded in statistical modeling improve speech enhancement performance compared to methods that predict a single filter mask only. Next, we explore the direct use of uncertainty estimates for speech enhancement tasks. This includes unsupervised domain adaptation, where we utilize uncertainty-based filtering to select high-quality pseudo-targets to alleviate generalization issues. In another application, alongside audio inputs, we further explore modeling uncertainty originating from distorted video signals in an audio-visual phoneme classification task and demonstrate how to exploit modality-wise uncertainty to achieve more effective and robust multimodal fusion. In the second part of the thesis, we investigate the issues of interpretability and robustness by focusing on deep generative approaches. In contrast to predictive approaches that learn a deterministic mapping between noisy and clean speech, deep generative approaches aim to learn prior distributions of given data and reuse this knowledge to perform speech enhancement during inference. In the thesis, we consider a specific group of methods, which use a variational autoencoder (VAE) to learn a prior distribution of clean speech and combine it with an untrained non-negative matrix factorization (NMF)-based noise model to estimate a filter mask for speech enhancement. The statistically interpretable VAE-NMF framework exhibits an improved generalization ability to unseen acoustic conditions compared to predictive methods. However, training the VAE solely with clean speech makes it susceptible to noise interference during testing, especially for inputs with low signal-to-noise ratios. In this part, we aim to improve overall robustness in difficult acoustic conditions by augmenting separately the speech and noise models with noise information. The resulting noise-aware speech and noise models retain the high interpretability provided by statistical modeling while at the same time exhibiting improved speech enhancement performance in acoustically challenging environments.en
dc.language.isoende_DE
dc.publisherStaats- und Universitätsbibliothek Hamburg Carl von Ossietzkyde
dc.rightshttp://purl.org/coar/access_right/c_abf2de_DE
dc.subjectSpeech Enhancementen
dc.subjectModel-based Approachesen
dc.subjectUncertainty Estimationen
dc.subjectPredictive and Generative Modelingen
dc.subject.ddc004: Informatikde_DE
dc.titleModel-Based Deep Speech Enhancement for Improved Interpretability and Robustnessen
dc.typedoctoralThesisen
dcterms.dateAccepted2025-06-02-
dc.rights.cchttps://creativecommons.org/licenses/by/4.0/de_DE
dc.rights.rshttp://rightsstatements.org/vocab/InC/1.0/-
dc.subject.gndSprachverarbeitungde_DE
dc.subject.gndMaschinelles Lernende_DE
dc.type.casraiDissertation-
dc.type.dinidoctoralThesis-
dc.type.driverdoctoralThesis-
dc.type.statusinfo:eu-repo/semantics/publishedVersionde_DE
dc.type.thesisdoctoralThesisde_DE
tuhh.type.opusDissertation-
thesis.grantor.departmentInformatikde_DE
thesis.grantor.placeHamburg-
thesis.grantor.universityOrInstitutionUniversität Hamburgde_DE
dcterms.DCMITypeText-
dc.identifier.urnurn:nbn:de:gbv:18-ediss-129133-
item.fulltextWith Fulltext-
item.creatorOrcidFang, Huajian-
item.advisorGNDGerkmann, Timo-
item.advisorGNDWermter, Stefan-
item.languageiso639-1other-
item.creatorGNDFang, Huajian-
item.grantfulltextopen-
Enthalten in den Sammlungen:Elektronische Dissertationen und Habilitationen
Dateien zu dieser Ressource:
Datei Prüfsumme GrößeFormat  
Dissertation_Fang_FINALVERSION_signed.pdf2af31767bfd2832958911881efcf220c5.35 MBAdobe PDFÖffnen/Anzeigen
Zur Kurzanzeige

Info

Seitenansichten

Letzte Woche
Letzten Monat
geprüft am null

Download(s)

Letzte Woche
Letzten Monat
geprüft am null
Werkzeuge

Google ScholarTM

Prüfe