- 1 view
DISSERTATION DEFENSE
Author : Xiangyu Hu
Advisor: Dr. Yan Tong
Date: Oct 24th, 2025
Time: 12:30 pm
Place: Teams
Link: https://teams.microsoft.com/l/meetup-join/19%3ameeting_Zjk5ZGM3NzctMzZm…
Abstract
With the rapid progress of deep learning, Facial Expression Recognition (FER) has seen substantial improvements in performance, particularly “in the wild” meaning real world conditions. Despite these advances, most existing methods extract features from facial images as the sole emotional cues, which limits the model’s ability to capture the full complexity of human emotional expressions.
In reality, facial expressions are composed of diverse and multi-perspective information, including appearance-based cues and geometric structural deformations due to activations of facial muscles. Depending exclusively on one type of representation may fail to exploit the complementary nature of these cues, an issue that becomes especially pronounced under real-world conditions involving poor image quality, occlusion, varying head poses, and diverse personal attributes.
To overcome this limitation, we investigate how multiple perspectives of information, such as multiple levels of semantic patterns, facial geometry captured in facial landmarks, and multimodal representation of facial expressions, can be effectively extracted from the same facial image and integrated to enrich expression-discriminative feature representations. This multi-perspective feature learning strategy not only provides a holistic interpretation of facial expressions, but also encourages the learning of robust, multi-level representations that enhance generalization.
Motivated by this, we introduce three novel models designed to extract and fuse complementary features across different representations from facial images, thereby improving both the accuracy and robustness of FER systems.
First, we propose a Cascaded Feature Fusion Network (CFFN) that leverages low-level semantic features to refine predictions typically dominated by high-level semantic information. CFFN utilizes a multi-branch architecture featuring Semantic Feature Fusion Blocks (SFFB) to enable effective communication between neighboring branches. Additionally, Multi-Branch Fusion Blocks (MBFB) integrate multi-scale semantic features, facilitating predictions from multilevel features. Experimental results demonstrate that the proposed model achieves state-of-the-art performance, with further cross-dataset evaluations highlighting its generalization capability.
Secondly, we propose a Context-Aware Multi-cue Model (CAMM) to enhance FER by jointly leveraging appearance, geometric, and semantic information. The framework utilizes two coordinated CNN backbones to extract complementary facial appearance and geometry features; while a pretrained vision–language model generates descriptive captions that are encoded into semantic embeddings. These embeddings are incorporated into both visual branches through a Text Fusion Block (TFB) built upon Adaptive Instance Normalization, enabling adaptive modulation of visual representations guided by global semantic context. In addition, a Weighted Dilated Block (WDB) is introduced to aggregate multi-scale spatial information with learnable attention weights, thereby enhancing contextual perception. By aligning high-level semantics with spatial structure and visual appearance, CAMM produces robust and discriminative representations, achieving state-of-the-art performance under real-world conditions.
Third, we introduce a Semantic-Consensus Multi-Modal Learning (SC-MML) framework to address the challenge of noisy labels in in-the-wild FER datasets. SC-MML incorporates high-level textual descriptions generated by a pretrained vision–language model as an auxiliary modality, providing robust semantic cues that capture nuanced facial attributes and contextual emotion. The framework comprises two coordinated components: a Consensus Branch that constructs noise-robust soft labels by aggregating mutual nearest neighbors across visual and textual embedding spaces, and a Discriminative Branch equipped with a Query-Guided Gated Fusion (QGGF) module. The QGGF adaptively fuses semantic and visual representations through a gating mechanism that highlights consistent and informative cues while suppressing noisy or redundant information. By grounding supervision in cross-modal semantic consensus rather than potentially corrupted categorical annotations, SC-MML effectively decouples learning from noisy labels and enhances representation reliability. This consensus-driven design strengthens the robustness to annotation noise and improves generalization in complex real-world scenarios. Extensive evaluations on multiple benchmark FER datasets demonstrate that SC-MML surpasses existing noise-robust methods, offering a principled and efficient paradigm for multimodal learning under noisy supervision.