Author : Xinyi Wu
Advisor : Dr. Song Wang
Date : June 7, 2022
Time 9:00 am
Place : Virtual (Zoom link below)
As a long-standing computer vision task, semantic segmentation is still extensively researched till now because of its importance to visual understanding and analysis. The goal of semantic segmentation is to classify each pixel of images based on the pre-defined classes. In the era of deep learning, convolutional neural networks largely improve the accuracy and efficiency of semantic segmentation. However, this success is achieved with two limitations: 1) a large-scale labeled dataset is required for training while the labeling process for this task is quite labor-intensive and tedious; 2) the trained deep networks can get promising results when testing on the same domain (i.e., intra-domain test) but might suffer from a large performance drop when testing on different domains (i.e., cross-domain test). Therefore, developing algorithms that can transfer knowledge from labeled source domains to unlabeled target domains is highly desirable to address these two limitations.
In this research, we explore three settings of cross domain semantic segmentation conditioned on the use of different training data in the target domain: 1) the use of a sole unlabeled target image, 2) the use of multiple unlabeled target images, and 3) the use of unlabeled target videos, respectively.
At the first part, we tackle the problem of one-shot unsupervised domain adaptation (OSUDA) for semantic segmentation where the segmentors only use one unlabeled target image during training. In this case, traditional unsupervised domain adaptation models usually fail since they cannot adapt to the target domain with over-fitting to one (or few) unlabeled target samples. To address this problem, existing OSUDA methods usually integrate a style-transfer module to perform domain randomization based on the unlabeled target sample, with which multiple domains around the target sample can be explored during training. However, such a style-transfer module relies on an additional set of images as style reference for pre-training and also increases the memory demand for domain adaptation. Here we propose a new OSUDA method that can effectively relieve such computational burden by making full use of the sole target image in two aspects: (1) implicitly stylizing the source domain in both image and feature levels; (2) softly selecting the source training pixels. Experimental results on two commonly-used synthetic-to-real scenarios demonstrate the effectiveness and efficiency of the proposed method.
Secondly, we work on the problem of nighttime semantic segmentation which plays an equally important role as that of daytime images in autonomous driving but is much more challenging and less studied due to poor illuminations and arduous human annotations. Our proposed solution employs an adversarial training with a labeled daytime dataset and an unlabeled dataset that contains coarsely aligned day-night image pairs. The unlabeled daytime images from the target dataset serve as an intermediate domain to mitigate the difficulty in day-to-night adaption since they share similarities with the source in illumination pattern and contain the same static-category objects as the their nighttime counterparts. Extensive experiments on Dark Zurich and Nighttime Driving datasets show that our method achieves state-of-the-art performance for nighttime semantic segmentation.
Finally, we propose a domain adaptation method for video semantic segmentation, i.e., the target is in video format. Before our work, other works were achieving this goal by transferring the knowledge from the source domain of self-labeled simulated videos to the target domain of unlabeled real-world videos. In our work, we argue that it is not necessary to use a labeled video dataset as the source since the temporal continuity of video segmentation in the target domain can be estimated and enforced without reference to videos in the source domain. This motivates a new framework of Image-to-Video Domain Adaptive Semantic Segmentation (I2VDA), where the source domain is a set of images without temporal information. Under this setting, we bridge the domain gap via adversarial training based on only the spatial knowledge, and develop a novel temporal augmentation strategy, through which the temporal consistency in the target domain is well-exploited and learned. In addition, we introduce a new training scheme by leveraging a proxy network to produce pseudo-labels on-the-fly, which is very effective to improve the stability of adversarial training. Experimental results on two synthetic-to-real scenarios show that the proposed I2VDA method can achieve even better performance on video semantic segmentation than existing state-of-the-art video-to-video domain adaption approaches.