Multimodal Social Cue Integration for Attention Modeling and Robot Gaze Control

Abawi, Fares

DC Element	Wert	Sprache
dc.contributor.advisor	Wermter, Stefan	-
dc.contributor.author	Abawi, Fares	-
dc.date.accessioned	2025-01-28T15:42:07Z	-
dc.date.available	2025-01-28T15:42:07Z	-
dc.date.issued	2024	-
dc.identifier.uri	https://ediss.sub.uni-hamburg.de/handle/ediss/11427	-
dc.description.abstract	Cognitive modeling is the creation of models of human behavior that can also be used to inform the development of intelligent robots. A common cognitive modeling task is saliency prediction. Saliency models predict regions in an image or video where a group of observers are most likely to gaze. Existing work on saliency models formulates the task as an end-to-end problem, predicting attention as a function of the stimuli. In this thesis, we identify the importance of social cues in directing attention, and therefore, introduce priors representing social cues into our models. A model augmented with social cues is defined as a social attention model. We show that the explicit representation of social cues improves the performance of existing saliency models. In contrast to saliency models, scanpath models predict the gaze trajectories of individual observers. We extend saliency models with a fixation history module, transforming them into scanpath prediction models. This transformation is necessary for deploying attention models on robots, especially in Human-Robot Interaction (HRI) settings, as it allows robots to exhibit humanlike gaze patterns rather than infer gaze transitions based on the aggregated attention of a group of observers. Additionally, it allows for the personalization of scanpaths using a single unified model, which in turn reduces the training time significantly, as well as the number of models required to achieve the same objective. Toward achieving our objective, we begin by evaluating the impact of non-verbal social cues on audiovisual saliency models. We design deep-learning models that integrate these social cues with existing saliency models, thereby improving saliency prediction in social settings. Saliency and social cues are represented as spatiotemporal maps and integrated through neural attention and gating mechanisms. A major advantage of our map representation approach is the ability to replace these maps at inference time without having to retrain or fine-tune the social attention model. We propose two architectures for integrating these maps. The first which we term late integration, combines features from multiple modality streams using convolutional Attentive Long Short-Term Memory (ALSTM) units. The resulting feature maps are then propagated to a Gated Multimodal Unit (GMU) model. The second integration architecture, which we term early fusion, lets one modality influence another via the GMU, which precedes the ALSTM, while maintaining separate streams for each modality. This allows us to weigh and quantify the impact of each social cue on task performance. Given that the saliency representation maps closely resemble our social attention model output, there is a potential drawback for shortcut learning to occur. This means that the model might become overly dependent on the most reliable cue, ignoring all others. Thus, to mitigate shortcut learning, we develop a neural attention inversion module, which we term the Directed Attention Module (DAM), based on the Squeeze-and-Excitation network. The DAM predicts the inverse of the social cue and saliency representation maps, thereby uniformly distributing the attention weights among social cue modalities. Therefore, it allows our social attention model to rely on all modality representations rather than those of the most salient modality only. Furthermore, to investigate the performance of our models under real-world conditions, we develop a software framework called Wrapyfi, which allows us to deploy and distribute the models on multiple machines and robots. Wrapyfi facilitates the distribution of models by introducing a common interface for different message-oriented and robotics middleware. This framework reduces the boilerplate code necessary for conducting robotic experiments by abstracting communication protocols and providing plugins that enable the exchange of many data types, including those defined by deep-learning frameworks. This allows us to focus on the design of our experimental pipelines, rather than the communication protocols between robots and software components. We utilize Wrapyfi to conduct HRI studies exploring the influence of robot social cues, namely gaze direction and facial expressions, on human behavior, collaboration, and perception. Moreover, Wrapyfi is used to manage the communication exchanges for our cognitive robotic simulations. These simulations rely on the embodiment of our social attention models into a physical robotic platform, demonstrating their resilience to sensor noise and their applicability in HRI. We introduce paradigms for quantitatively evaluating these cognitive simulations, allowing us to scale up the assessment of our models’ performance on robots, without requiring human feedback. Realizing the impact of sound on attention and gaze, we extend an existing audiovisual saliency prediction model with an additional auditory stream, effectively transforming it into a binaural model. This enables the model to localize sound in videos, thus expanding the capabilities of social attention models relying on its representation maps. Additionally, given that the attention patterns of individual observers are distinct from those of a group of observers, we extend our social attention saliency model into a scanpath predictor by integrating a fixation history module. Finally, the model is validated in a cognitive robotic simulation setup, allowing us to compare the robot’s performance to that of humans.	en
dc.description.abstract	Cognitive Modellierung dient dazu, Modelle zum Verhalten von Menschen zu erstellen, sowie die Entwicklung intelligenter Roboter zu unterstützen. Eine häufige Aufgabe der kognitiven Modellierung ist die Salienzvorhersage. Salienzmodelle sagen voraus, wenn Beobachter ein Bild oder Video anschauen, auf welche Bereiche sie als Gruppe am ehesten schauen werden. Bestehende Forschungsarbeiten zu Salienzmodellen formulieren die Salienzvorhersage als ein Ende-zu-Ende Problem, bei dem die Aufmerksamkeit der Beobachter als Funktion der Eingabestimuli vorhergesagt wird. In dieser Dissertation untersuchen wir die Bedeutung sozialer Hinweise zur Lenkung der Aufmerksamkeit und führen daher A-Priori Faktoren, die durch soziale Hinweise repräsentiert werden, in unsere Modelle ein. Ein Modell, das mit sozialen Hinweisen erweitert wird, definieren wir soziales Aufmerksamkeitsmodell. Wir zeigen, dass die Einbeziehung sozialer Hinweise die Leistung bestehender Salienzmodelle verbessert. Im Gegensatz zu Salienzmodellen sagen Scanpfadmodelle die Blickverläufe einzelner Beobachter voraus. Wir erweitern Salienzmodelle mit einem Modul, das die Fixationshistorie einbezieht, und verwandeln sie in Scanpathmodelle. Diese Transformation ist notwendig, um Aufmerksamkeitsmodelle in Robotern einzusetzen, insbesondere im Kontext von Mensch-Roboter Interaktion (Human-Robot Interaction, HRI), da sie es Robotern ermöglicht, individuelle menschenähnliche Blickmuster zu generieren, anstatt Blickübergänge basierend auf der aggregierten Gruppenaufmerksamkeit abzuleiten. Darüber hinaus erleichtert es die Personalisierung von Scanpfaden mit einem einzigen vereinheitlichten Modell, was wiederum die Trainingszeit und die Anzahl der spezifischen Modelle erheblich reduziert, die erforderlich wären, um dasselbe Ziel zu erreichen. Um unser Ziel zu erreichen, beginnen wir mit der Bewertung des Einflusses nonverbaler sozialer Hinweise auf audiovisuelle Salienzmodelle. Wir entwickeln Deep-Learning Ansätze, die diese sozialen Hinweise in bestehende Salienzmodelle integrieren und dadurch deren Performanz in sozialen Umgebungen verbessern. Die Salienz und die sozialen Hinweise werden als raumzeitliche Karten dargestellt und durch neuronale Attention- und Gating-Mechanismen integriert. Ein großer Vorteil unserer Kartenrepräsentation ist die Möglichkeit, diese Karten zur Inferenzzeit austauschen zu können, ohne das soziale Aufmerksamkeitsmodell neu trainieren oder feinabstimmen zu müssen. Wir schlagen zwei Architekturen zur Integration dieser Karten vor. Die erste, die wir Late Integration nennen, kombiniert Merkmale aus mehreren Modalitäten unter Verwendung des convolutional Attentive-LSTM (ALSTM) Modells. Die resultierenden Merkmalskarten werden dann auf ein Gated Multimodal Unit (GMU) Modell übertragen. In der zweiten Integrationsarchitektur, die wir Early Fusion nennen, moduliert eine Modalität eine andere, indem das GMU den ALSTM-Units vorausgeht, wobei die Modalitäten separiert bleiben. Dies ermöglicht es, jeden sozialen Hinweis zu gewichten und dessen Einfluss auf die Modellperformanz zu ermitteln. Da die Salienzrepräsentationskarten den Ausgaben unseres sozialen Aufmerksamkeitsmodells ähneln, kann als Nachteil Shortcut Learning auftreten. Das heißt, das Modell verlässt sich nur auf den zuverlässigsten Hinweis und ignoriert alle anderen Hinweise. Daher entwickeln wir zur Reduzierung von Shortcut Learning ein neuronales Aufmerksamkeitsinversionsmodul, das wir als Directed Attention Module (DAM) bezeichnen, basierend auf dem Squeeze-and-Excitation Netzwerk. Das DAM sagt das Inverse der sozialen Hinweis und Salienzrepräsentationskarten vorher und verteilt somit die Aufmerksamkeitsgewichte gleichmäßig auf die sozialen Hinweise. Somit ermöglicht es unserem sozialen Aufmerksamkeitsmodell, sich auf alle Modalitätsdarstellungen zu stützen und nicht nur auf die einer Modalität. Darüber hinaus entwickeln wir, um die Leistung unserer Modelle unter realen Bedingungen zu untersuchen, ein Software-Framework namens Wrapyfi, das es uns ermöglicht, die Modelle auf mehreren Computern und Robotern zu implementieren und zu verteilen. Wrapyfi erleichtert die Verteilung von Modellen, indem es eine gemeinsame Schnittstelle für verschiedene nachrichtenorientierte und robotische Middleware bereitstellt. Dieses Framework reduziert die Codebausteine, die für die Durchführung robotischer Experimente notwendig sind, indem es Kommunikationsprotokolle abstrahiert und Plugins bereitstellt, die den Austausch vieler Datentypen, einschließlich derjenigen, die von Deep-Learning Frameworks definiert werden, ermöglichen. Damit können wir uns auf das Design unserer experimentellen Pipelines konzentrieren, anstatt auf die Kommunikationsprotokolle zwischen Robotern und Softwarekomponenten. Wir nutzen Wrapyfi zur Durchführung von HRI-Studien, die den Einfluss robotischer sozialer Hinweise, nämlich Blickverhalten und Gesichtsausdrücke, auf menschliches Verhalten, Zusammenarbeit und Wahrnehmung untersuchen. Darüber hinaus wird Wrapyfi verwendet, um den Kommunikationsaustausch für unsere kognitiven robotischen Simulationen zu verwalten. Diese Simulationen basieren auf der Einbettung unserer sozialen Aufmerksamkeitsmodelle in eine physische robotische Plattform und demonstrieren deren Robustheit gegenüber Sensorrauschen und ihre Anwendbarkeit in HRI. Wir führen Paradigmen ein, um diese kognitiven Simulationen quantitativ zu bewerten und ermöglichen so die Skalierung der Leistungsbewertung unserer Modelle auf Robotern, ohne menschliches Feedback zu erfordern. In Anbetracht des Einflusses von Geräuschen auf unsere Aufmerksamkeit und unsere Blickrichtung erweitern wir ein bestehendes audiovisuelles Salienzmodell um einen zusätzlichen auditiven Stream und verwandeln es effektiv in ein binaurales Modell. Dies ermöglicht es dem Modell, Geräusche in Videos zu lokalisieren, und erweitert so die Fähigkeiten sozialer Aufmerksamkeitsmodelle, die sich auf seine Repräsentationskarten stützen. Da die Aufmerksamkeitsmuster einzelner Beobachter sich von denen einer Gruppe von Beobachtern unterscheiden, erweitern wir unser soziales Aufmerksamkeits-Salienzmodell zu einem Scanpathmodell mittels eines Fixationshistorienmoduls. Das Modell wird schließlich in Experimenten mit Probanden und mit dem Roboter in einer Cognitiven Simulation validiert.	de
dc.language.iso	en	de_DE
dc.publisher	Staats- und Universitätsbibliothek Hamburg Carl von Ossietzky	de
dc.rights	http://purl.org/coar/access_right/c_abf2	de_DE
dc.subject	Robot Gaze Control	en
dc.subject	Multimodal Social Cue Integration	en
dc.subject	Human-Robot Interaction	en
dc.subject	Attention Modeling	en
dc.subject	Early Fusion and Late Integration	en
dc.subject	Middleware	en
dc.subject.ddc	004: Informatik	de_DE
dc.title	Multimodal Social Cue Integration for Attention Modeling and Robot Gaze Control	en
dc.type	doctoralThesis	en
dcterms.dateAccepted	2024-12-16	-
dc.rights.cc	https://creativecommons.org/licenses/by/4.0/	de_DE
dc.rights.rs	http://rightsstatements.org/vocab/InC/1.0/	-
dc.subject.bcl	54.72: Künstliche Intelligenz	de_DE
dc.subject.gnd	Convolutional Neural Network	de_DE
dc.subject.gnd	Neuronales Netz	de_DE
dc.subject.gnd	Humanoider Roboter	de_DE
dc.subject.gnd	Salienz	de_DE
dc.subject.gnd	Visuelle Aufmerksamkeit	de_DE
dc.subject.gnd	Verteilte künstliche Intelligenz	de_DE
dc.type.casrai	Dissertation	-
dc.type.dini	doctoralThesis	-
dc.type.driver	doctoralThesis	-
dc.type.status	info:eu-repo/semantics/publishedVersion	de_DE
dc.type.thesis	doctoralThesis	de_DE
tuhh.type.opus	Dissertation	-
thesis.grantor.department	Informatik	de_DE
thesis.grantor.place	Hamburg	-
thesis.grantor.universityOrInstitution	Universität Hamburg	de_DE
dcterms.DCMIType	Text	-
datacite.relation.IsSupplementedBy	doi:10.24963/ijcai.2021/81	de_DE
datacite.relation.IsSupplementedBy	doi:10.1007/s12369-023-00993-3	de_DE
datacite.relation.IsSupplementedBy	doi:10.1109/RO-MAN57019.2023.10309334	de_DE
datacite.relation.IsSupplementedBy	doi:10.1145/3610977.3637471	de_DE
datacite.relation.IsSupplementedBy	doi:10.1145/3610978.3640580	de_DE
datacite.relation.IsSupplementedBy	doi:10.48550/arXiv.2405.02929	de_DE
dc.identifier.urn	urn:nbn:de:gbv:18-ediss-117303	-
item.languageiso639-1	other	-
item.fulltext	With Fulltext	-
item.advisorGND	Wermter, Stefan	-
item.grantfulltext	open	-
item.creatorOrcid	Abawi, Fares	-
item.creatorGND	Abawi, Fares	-
Enthalten in den Sammlungen:	Elektronische Dissertationen und Habilitationen

Dateien zu dieser Ressource:

Datei	Beschreibung	Prüfsumme	Größe	Format
DoctoralDissertation_FaresAbawi_2024.pdf		b169281bd8d381e1926d7da14e229332	29.28 MB	Adobe PDF	Öffnen/Anzeigen

Zur Kurzanzeige

Info

Seitenansichten

258

Letzte Woche

Letzten Monat

geprüft am 03.07.2025

Download(s)

301

Letzte Woche

Letzten Monat

geprüft am 03.07.2025

Werkzeuge

Google Scholar^TM

Prüfe

Dateien zu dieser Ressource:

Seitenansichten

Download(s)

Google ScholarTM

Google Scholar^TM