Improving Object Detection in Adverse Weather Conditions
Object detection in adverse weather conditions remains a crucial challenge for autonomous vehicles. Existing works focus on developing CNN-based architectures for detecting objects in harsh weather conditions while attention-based methods have been seldom explored. In this paper, we propose a novel conditional transformer-based architecture that leverages an offline-computed memory bank comprising rich feature embeddings (extracted from Segment Anything Model) to refine adverse feature embeddings (extracted from Feature Pyramid Network (FPN)). This conditional block can be integrated with any feature extractor for any vision task. For the scope of this paper, we show that our proposed architecture outperforms Faster RCNN by 4.423 mAP. We conduct thorough experiments and ablations to analyze the performance and significance of each component of our architecture on the BDD100K dataset.
Proposed Method:
- Memory Bank: We generate embeddings for each object iteratively in every image by masking out everything but the object. Further, we store each of these embeddings in a temporary memory bank. We used K-means clustering to reduce the size of the memory bank by representing each object class in the dataset through M representative feature embeddings. This procedure is followed only for the clear images in the BDD100K dataset.
- Conditional Transformer: The conditional module processes adverse feature embeddings extracted from the FPN backbone. These embeddings are reshaped and refined by selecting top-k representative embeddings from a memory bank based on cosine similarity. This approach accounts for the presence of multiple object classes within the embeddings. The module uses transformer components for self-attention and cross-attention to integrate and enhance the features. The outputs are structured to align embeddings from self-attention with those refined by cross-attention, optimizing the representation for downstream tasks.
- Integration with Faster RCNN: Faster R-CNN consists of an FPN backbone, a Region Proposal Network (RPN) head, and a Region of Interest (ROI) head. In this work, the conditional module is used to refine intermediate feature embeddings from the FPN before they are passed to the RPN and ROI heads for classification and regression tasks. The selection of intermediate embeddings for refinement was based on empirical analysis.