Improving Speech-related Facial Action Unit Recognition by Audiovisual Information Fusion
- 9 views
Tuesday, February 27, 2018 - 08:00 am
Meeting room 2267, Innovation Center
DISSERTATION DEFENSE
Zibo Meng
Advisor : Dr. Yan Tong
Abstract
In spite of great progress achieved on posed facial display and controlled image acquisition, performance of facial action unit (AU) recognition degrades significantly for spontaneous facial displays. Furthermore, recognizing AUs accompanied with speech is even more challenging since they are generally activated at a low intensity with subtle facial appearance/geometrical changes during speech, and more importantly, often introduce ambiguity in detecting other co-occurring AUs, e.g., producing non-additive appearance changes. All the current AU recognition systems utilized information extracted only from visual channel. However, sound is highly correlated with visual channel in human communications. Thus, we propose to exploit both audio and visual information for AU recognition.
Specifically, a feature-level fusion method combining both audio and visual features is first introduced. Specifically, features are independently extracted from visual and audio channels. The extracted features are aligned to handle the difference in time scales and the time shift between the two signals. These temporally aligned features are integrated via feature-level fusion for AU recognition. Second, a novel approach that recognizes speech-related AUs exclusively from audio signals based on the fact that facial activities are highly correlated with voice during speech is developed. Specifically, dynamic and physiological relationships between AUs and phonemes are modeled through a continuous time Bayesian network (CTBN); then AU recognition is performed by probabilistic inference via the CTBN model. Third, a novel audiovisual fusion framework, which aims to make the best use of visual and acoustic cues in recognizing speech-related facial AUs is developed. In particular, a dynamic Bayesian network (DBN) is employed to explicitly model the semantic and dynamic physiological relationships between AUs and phonemes as well as measurement uncertainty. AU recognition is then conducted by probabilistic inference via the DBN model.
To evaluate the proposed approaches, a pilot AU-coded audiovisual database was collected. Experiments on this dataset have demonstrated that the proposed frameworks yield significant improvement in recognizing speech-related AUs compared to the state-of-the-art visual-based methods. Furthermore, more impressive improvement has been achieved for those AUs, whose visual observations are impaired during speech.

Abstract
A wide range of modern software-intensive systems (e.g., autonomous systems, big data analytics, robotics, deep neural architectures) are built configurable. These systems offer a rich space for adaptation to different domains and tasks. Developers and users often need to reason about the performance of such systems, making tradeoffs to change specific quality attributes or detecting performance anomalies. For instance, developers of image recognition mobile apps are not only interested in learning which deep neural architectures are accurate enough to classify their images correctly, but also which architectures consume the least power on the mobile devices on which they are deployed. Recent research has focused on models built from performance measurements obtained by instrumenting the system. However, the fundamental problem is that the learning techniques for building a reliable performance model do not scale well, simply because the configuration space is exponentially large that is impossible to exhaustively explore. For example, it will take over 60 years to explore the whole configuration space of a system with 25 binary options.
In this talk, I will start motivating the configuration space explosion problem based on my previous experience with large-scale big data systems in industry. I will then present my transfer learning solution to tackle the scalability challenge: instead of taking the measurements from the real system, we learn the performance model using samples from cheap sources, such as simulators that approximate the performance of the real system, with a fair fidelity and at a low cost. Results show that despite the high cost of measurement on the real system, learning performance models can become surprisingly cheap as long as certain properties are reused across environments. In the second half of the talk, I will present empirical evidence, which lays a foundation for a theory explaining why and when transfer learning works by showing the similarities of performance behavior across environments. I will present observations of environmental changes‘ impacts (such as changes to hardware, workload, and software versions) for a selected set of configurable systems from different domains to identify the key elements that can be exploited for transfer learning. These observations demonstrate a promising path for building efficient, reliable, and dependable software systems. Finally, I will share my research vision for the next five years and outline my immediate plans to further explore the opportunities of transfer learning.
Pooyan Jamshidi is a postdoctoral researcher at Carnegie Mellon University, where he works on transfer learning for building performance models to enable dynamic adaptation of mobile robotics software as a part of BRASS, a DARPA sponsored project. Prior to his current position, he was a research associate at Imperial College London, where he worked on Bayesian optimization for automated performance tuning of big data systems. He holds a Ph.D. from Dublin City University, where he worked on self-learning Fuzzy control for auto-scaling in the cloud. He has spent 7 years in industry as a developer and a software architect. His research interests are at the intersection of software engineering, systems, and machine learning, and his focus lies predominantly in the areas of highly-configurable and self-adaptive systems (more details:
Abstract:
The recent proliferation of acoustic devices, ranging from voice assistants to wearable health monitors, is leading to a sensing ecosystem around us -- referred to as the Internet of Acoustic Things or IoAT. My research focuses on developing hardware-software building blocks that enable new capabilities for this emerging future. In this talk, I will sample some of my projects. For instance, (1) I will demonstrate carefully designed sounds that are completely inaudible to humans but recordable by all microphones. (2) I will discuss our work with physical vibrations from mobile devices, and how they conduct through finger bones to enable new modalities of short range, human-centric communication. (3) Finally, I will draw attention to various acoustic leakages and threats that arrive with sensor-rich environments. I will conclude this talk with a glimpse of my ongoing and future projects targeting a stronger convergence of sensing, computing, and communications in tomorrow’s IoT, cyber-physical systems, and healthcare technologies.
Bio:

