Titel: Domain Defining Context: On Domain-Dependent Corpus Expansion and Contextualized Semantic Structuring
Sprache: Englisch
Autor*in: Remus, Steffen
Schlagwörter: contextualized; embedding; sense; domain; knowledge
GND-Schlagwörter: Kontextbezogenes SystemGND
Spider <Programm>GND
Natürliche SpracheGND
Künstliche IntelligenzGND
Erscheinungsdatum: 2023-04
Tag der mündlichen Prüfung: 2023-06-19
Natural language processing (NLP) is the task of processing potentially very large text collections and structuring the contained information so that it can be presented to the human user in a structured and condensed form. In linguistics and philosophy, it has been stated that communication between humans can be misleading and ambiguous and that the situational context is important to resolve the intended and expected meaning of utterances. As intelligent beings, however, humans can use the situational context to resolve ambiguities, such as body language, gestures, tone, surroundings, and so on. But using computers to achieve a goal based on natural language input or output is always prone to ambiguity on various levels because the situational context cannot be easily supplied. A computer task that involves NLP, e.g., to find documents given a query or classify documents into pre-defined classes such as spam or ham (not spam), often contains assumptions by the user, the human, that cannot be disambiguated easily by the computer without the provision of proper context.

In this dissertation, several studies have been conducted to incorporate context at various levels. First, we use the definition of a domain to set the context on a vocabulary basis, i.e., by using datasets that contain mainly content words from the desired domain, we narrow down the sample space of ambiguities. For instance, consider searching for the word 'virus' in a search engine: without any domain restriction, the best thing a computer can reply is the so-called major sense or mixed results based on the processed data, i.e., documents about organisms or computers or anything else will be returned if the underlying corpus was too generic. By restricting the underlying dataset to the biology domain, computers will reply mainly, or even exclusively in the ideal case, with documents from the biology domain. To collect domain corpora that can be used as a basis for NLP algorithms, we propose a simple yet effective technique for focused web crawling using statistical language models. Focused web crawling is resource-friendly since it ideally downloads only documents which are relevant to the domain of interest. The domain of interest is defined by a statistical language model created from a rather small corpus, e.g., we showed that a single Wikipedia article combined with a simple Kneser-Ney three-gram model is sufficient to guide the web-crawling process efficiently.

Another option to supply context is the computational representation of words, i.e., in NLP -- which nowadays mainly relies on machine learning techniques -- words, documents, or more general samples are represented by mathematical vectors, a.k.a. embeddings. Estimating such embeddings is an active area of research; in this dissertation, we retrofit so-called static word embeddings, i.e., the same static vector representation for each occurrence of a word, to so-called sense embeddings, i.e., an embedding in the vector space represents a single, distinct, sense of a word. Another popular representation is called contextualized word embeddings, where the process of estimation is done by deep neural networks. Here, the so-called attention module usually provides a flow of information from one word of a sequence to another, i.e., when supplying a sequence, the word of interest is implicitly disambiguated, and the embedding mirrors the location shift based on the sequential context. We investigate the ability of contextualized word embeddings to model senses, analyze their performance in an information retrieval setup, and test their suitability to induce semantic relations from text. We show that contextualized embeddings are very suitable for modeling senses, retrieving similar sentences regarding a certain objective, and inducing semantic relations using unsupervised clustering methodologies. Finally, we can confirm that context certainly matters.
URL: https://ediss.sub.uni-hamburg.de/handle/ediss/10394
URN: urn:nbn:de:gbv:18-ediss-110991
Dokumenttyp: Dissertation
Betreuer*in: Biemann, Chris
Enthalten in den Sammlungen:Elektronische Dissertationen und Habilitationen

Dateien zu dieser Ressource:
Datei Beschreibung Prüfsumme GrößeFormat  
Steffen_Remus__Dissertation__Domain-Defining-Context_On-Domain-Dependent-Corpus-Expansion-and-Contextualized-Semantic-Structuring.pdf932de81f764bffa131c3e5ca09e4ccc011.13 MBAdobe PDFÖffnen/Anzeigen
Zur Langanzeige



Letzte Woche
Letzten Monat
geprüft am 11.08.2023


Letzte Woche
Letzten Monat
geprüft am 11.08.2023

Google ScholarTM