1Harbin Institute of Technology
2State Key Laboratory of Robotics and System
3University of Liverpool
† corresponding author
We propose ReMake, Relative depth and Mask attention guided depth completion. We predict a complete depth map from a single RGB-D image, enabling accurate target object points extraction and 6-DoF grasp prediction.
Transparent objects are common in daily life. Unlike opaque objects, light emitted from depth cameras is often refracted or reflected on transparent surfaces, resulting in missing or erroneous depth values. Only regions where reflected light returns to the camera yield valid depth values. The mixture of missing, incorrect, and valid depth introduces significant noise, hindering accurate perception and reliable grasping. However, most existing methods only feed RGB-D image into the network, requiring the model to implicitly distinguish between reliable and unreliable depth. While effective on training datasets, these methods struggle to generalize to real-world scenes due to the highly variable distribution of valid and invalid depth caused by complex light interactions. This motivates us to rethink the efficiency of current training strategies. To address this, we propose ReMake, a novel depth completion framework guided by instance mask and relative depth. The mask explicitly distinguishes regions from non-transparent ones, enabling the model to focus on learning to predict accurate depth in transparent areas from RGB-D input during its training process. This targeted supervision reduces reliance on implicit reasoning and improves generalization to real-world scenarios. Additionally, the relative depth map encodes spatial relationships between the transparent object and its surroundings, enhancing prediction accuracy. Extensive experiments show that our method outperforms existing approaches on both benchmark datasets and real-world scenarios, demonstrating superior accuracy and generalization capability.
The instance mask and relative depth map are generated via segmentation and monocular depth estimation. The RGB image concatenated with the mask is encoded using a Transformer. The relative depth and original depth are separately encoded, and all features are fused and decoded to predict the complete depth map. The target object's point cloud is then extracted using the instance mask.
@article{cheng2025rethinkingtransparentobjectgrasping,
title={Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask},
author={Yaofeng Cheng and Xinkai Gao and Sen Zhang and Chao Zeng and Fusheng Zha and Lining Sun and Chenguang Yang},
year={2025},
eprint={2508.02507},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.02507},
}
If you have any questions, please feel free to contact us: