Kenneth Blomqvist1, Lionel Ott1, Jen Jen Chung1, 2, Roland Siegwart1
1Autonomous Systems Lab, ETH Zurich, Switzerland
2School of ITEE, University of Queensland, Australia
A key limitation of NeRF based semantic segmentation approaches is that all the optimization is made on a per-scene basis and no use of information learned outside of that scene. In this project, we wanted to improve the label-efficiency of NeRF-based semantic segmentation by leveraging image features learned on large and diverse datasets.
We use a feature extractor, such as DINO, to extract semantically meaningful features from images and bake these into the NeRF scene representation by rendering features as an additional output modality and supervising on the extracted features.
Here is an example of what such extracted DINO features look like when mapped to RGB colors using PCA.
Our results show that by using such features, we both significantly improve semantic label-efficiency and overall accuracy, even when there are abundant semantic labels available.
Here are some example outputs from our method on real world scenes. In the following videos, the top left shows original RGB input, top right shows inferred depth, bottom left shows produced semantic segmentation maps and bottom right visualizes the reconstructed DINO features.
Code and instructions on how to run it is available here.