Skip to content
/ refnet Public

A Resource Efficient Fusion Network for Object Detection in Bird’s-Eye View using Camera and Raw Radar Data

License

Notifications You must be signed in to change notification settings

tue-mps/refnet

Repository files navigation

REFNET: Resource Efficient Fusion Network for Object Detection in Bird’s-Eye View

News

  • (2024/07/10) Accepted to IEEE ITSC 2024!
  • (2024/11/20) Paper

Overview

The image processing pipeline first transforms the camera image into Bird’s-Eye View (BEV). Subsequently, the resultant BEV undergoes conversion into Polar representation, directly mapping to the Range-Azimuth (RA) image. Object detection is performed on RA image features fused with radar features from the radar decoder. The predictions obtained in the RA view are shown in the camera images with ground-truth bounding boxes in green and predictions in blue.

Abstract

Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, at first, we transform the camera images to Bird's-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy, but also on computational complexity.

Fusion Architecture

The camera only and radar only encoder contains four ResNet-50-like blocks with a pre-encoder block. The features from each of those blocks are named x0, x1, x2, x3, and x4. The thick blue curved arrow takes the encoder’s output to the decoder’s input in order to expand the input feature maps to higher resolutions. The dotted lines represent the skip connections used to preserve spatial information. The features from the camera only decoder and radar only decoder are then fused before passing them to the detection head. The head finally predicts the objects in Bird’s Eye RA Polar View.

The models are trained and tested on the RADIal dataset. The dataset can be downloaded here. Under resources/gen_polarimage, we provide functions that help in converting front-facing camera images to BEV Cartesian and subsequently to BEV Polar.

Results

Vehicle detection performances on the RADIal dataset test split. RD, ADC, RPC, RT, C correspond to Range-Doppler, Analog-To-Digital Converter signal, Radar Point Cloud, Range-Time signal, and Camera data respectively. Best values are in bold and second best are underlined. †: reimplemented with only detection head as they are multi-tasking models. The missing values are indicated by a ”-”, either due to code unavailability or unreported in the respective works.

Model AP AR F1 RE AE # FPS σ Size GPU cost
FFTRadNet 93.45 83.35 88.11 0.12 0.15 3.23 68.46 2.19 39.2 2.01
TFFTRadNet 90.80 88.31 89.54 0.15 0.13 9.08 54.37 4.28 109.5 2.04
ADCNet 95 89 91.9 0.13 0.11 - - - - -
CMS 96.9 83.49 89.69 0.45 - 7.7 - - - -
ROFusion 91.13 95.29 93.16 0.13 0.21 3.33 56.11 1.55 87.2 2.87
EchoFusion 96.95 93.43 95.16 0.12 0.18 25.61 - - 102.5 -
Ours 95.75 91.35 93.49 0.11 0.09 6.58 58.91 1.28 79.8 2.06
  • AP: Average Precision (%)
  • AR: Average Recall (%)
  • F1: F1 Score (%)
  • RE: Range Error (meters)
  • AE: Angle Error (degrees)
  • #: Number of Parameters (in Millions)
  • FPS: Frames Per Second: for a given model, a FPS value is computed for each frame in the test set and averaged.
  • σ: Standard Deviation computed from FPS values.
  • Size: Size of the model in MB.
  • GPU cost: GPU memory consumption while inference in GB.

Setting up the virtual environment

Requirements

All the codes are tested in the following environment:

  • Linux (tested on Ubuntu 22.04)
  • Python 3.9

Installation

  1. Clone the repo and set up the conda environment:
$ git clone "this repo"
$ conda create --prefix "your_path" python=3.9 -y
$ conda update -n base -c defaults conda
$ source activate "your_path"
  1. The following are the packages used:
$ conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
$ pip install -U pip
$ pip3 install pkbar
$ pip3 install tensorboard
$ pip3 install pandas
$ pip3 install shapely
$ pip3 install opencv-python
$ pip3 install einops
$ pip3 install timm
$ pip3 install scipy
$ pip3 install scikit-learn
$ pip3 install polarTransform
$ pip3 install matplotlib
$ pip3 install numpy==1.23

Running the code

Training

Please run the following to train the model. Camera-Radar fusion in BEV for object detection will be chosen by default.

$ python 1-Train.py

Evaluation

To evaluate the model performance, please load the trained model and run:

$ python 2-Evaluation.py

Testing

To obtain qualitative results, please load the trained model and run:

$ python 3-Test.py

Computational complexity

To compute Frames Per Second (FPS), please load the trained model and run:

$ python 4-FPS.py

Further research

  • Even though, we focus only on object detection, there is a scope to include segmentation head in our model, which we leave open to the community or may pursue ourselves.
  • We have also proposed an early fusion architecture that intakes camera images and the point cloud data from imaging radar and lidar sensor in perspective view. Our code can be extended for further analysis. All parameters can be found in our configuration file, which is available here: config/config_allmodality.json.
  • We plan to further accelerate this research using k-radar dataset.

Acknowledgments

  • Thanks to Elektrobit Automotive GmbH and Mobile Perception Systems Lab from Eindhoven University of Technology for their continous support.
  • This project is inspired from the codebase from RADIal.

License

The repo is released under the BSD 3-Clause License.

About

A Resource Efficient Fusion Network for Object Detection in Bird’s-Eye View using Camera and Raw Radar Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published