Visualizing features with PCA.
TL;DR: Splat and Distill is a semi-self-supervised framework designed to instill 3D awareness into Vision Foundation Models (VFMs) like DINOv2 by enforcing geometric consistency.
Our pipeline initializes a student and a teacher model from a pretrained VFM. We utilize feed-forward Gaussian Splatting to reconstruct a 3D scene from context views of a scene, embedding it with 2D features extracted from the teacher. By "lifting" these features into a 3D representation and splatting them onto novel target views, we generate a "3D aware" supervisory signal. The student is then trained to predict features that match these projections. Similar to other Self-Supervised Learning (SSL) paradigms, the teacher is adaptively updated via an Exponential Moving Average (EMA) of the student’s weights. We demonstrate that these fine-tuned features significantly improve performance on downstream tasks—such as semantic segmentation and depth estimation—across diverse datasets.
Abstract
Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then "splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, "distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher’s consistency improves alongside that of the student. Our method significantly outperforms prior works, achieving substantial gains in 3D awareness and enhancing the underlying semantic richness of 2D features.
Method Overview
Method Overview. Starting from the LHS, two context views Ijctx are passed through a teacher network, producing two low-resolution 2D feature maps Fjctx. Using corresponding semantic masks, mask-aware upscaling (Sec. 3.1) produces 2D features Fjhigh of the input resolution.
In parallel, a pretrained feed-forward 3D reconstruction model predicts 3D Gaussian primitives {μj, Σj, αj} using the same context views Ijctx (Sec. 3.2). The upscaled 2D feature maps, Fjhigh, are then lifted to these 3D Gaussian primitives, using 2D-3D correspondences, yielding a feature-augmented GS scene Gj ← {μj, Σj, αj} ∪ {fj} (Sec. 3.1).
Next, the scene is splatted to a target viewpoint, producing a 2D feature map, which is then blended with the semantic mask of the target view, resulting in 2D features Fblendtgt (Sec. 3.3). Concurrently, as shown on the RHS, the target image Itgt (corresponding to the rendered viewpoint) is passed through the student network to obtain its feature map Fstgt. Fblendtgt is then downscaled (bilinearly) producing a lower resolution 2D feature map which is compared to Fstgt to supervise the student via a distillation loss (Sec. 3.4). The teacher's weights are updated as an EMA of the student's weights. Note that SnD is finetuned on ScanNet++.
Qualitative Results
We evaluate our method's ability to enhance the 3D awareness and semantic representation of DINOv2 features through downstream tasks including semantic segmentation, depth estimation, surface normal prediction, and multi-view correspondence.
While fine-tuned solely on indoor ScanNet++ scenes, our model generalizes effectively to out-of-domain datasets like KITTI, ADE20k, and Pascal VOC. Our approach captures finer structural details and produces cleaner, less noisy results compared to the baseline, maintaining smoothness and consistency across both indoor and outdoor environments.
BibTeX
@inproceedings{shavin2026splat,
title={Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation},
author={Shavin, David and Benaim, Sagie},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://arxiv.org/abs/2602.06032},
eprint={2602.06032},
archivePrefix={arXiv},
primaryClass={cs.CV}
}