H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving (AAAI 2025)
With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges.
In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving.
However, videos in such a dynamical scene that often contains complex spatial-temporal movements,
which restricts the generalization capacity of the existing MLLMs in this field.
To bridge the gap,
we propose a novel
Traditional models are often limited to predefined questions, limiting their applicaiton in open-world situations. Benefiting from the powerful reasoning capabilities of LLM models, our method exhibits good generalization ability, enabling it to be directly applied to real-world scenarios for simple question-answering conversationals in a unified paradigm. For example, it can effectively provide driving risk warning alerts as shown in the image below. However, the model's responses remain highly dependent on the examples in the training data. In real-world application scenarios, we often encounter long-tail cases that are not included in the training set, such as suddenly dropped cargo from a vehicle ahead, road obstacles, or animals unexpectedly crossing the path. In such situations, the model often fail to make correct judgments.
And we have explored to solve such problem in our next paper.The model and codes will be available soon.