Neural Implicit Vision-Language Feature Fields are an approach to dense scene understanding. It is a neural implicit scene representation with additional vision-language feature outputs, which can be correlated with text prompts, enabling zero-shot open-vocabulary scene segmentation. You can basically type in the thing you are looking for, and it will highlight the relevant parts of the scene.

The scene representation can be built up incrementally, it runs in real-time on real hardware, it can handle millions of semantic point queries per second and works for any text queries. See here for the research paper.