ReMake

Rethinking Transparent Object Grasping:

Depth Completion with Monocular Depth Estimation and Instance Mask

Yaofeng Cheng^1,2,3 Xinkai Gao^1,2 Sen Zhang^1,2 Chao Zeng³ Fusheng Zha^1,2† Lining Sun¹ Chenguang Yang^3†

¹Harbin Institute of Technology ²State Key Laboratory of Robotics and System ³University of Liverpool

^† corresponding author

Paper

Code

We propose ReMake, Relative depth and Mask attention guided depth completion. We predict a complete depth map from a single RGB-D image, enabling accurate target object points extraction and 6-DoF grasp prediction.

Abstract

Transparent objects are common in daily life. Unlike opaque objects, light emitted from depth cameras is often refracted or reflected on transparent surfaces, resulting in missing or erroneous depth values. Only regions where reflected light returns to the camera yield valid depth values. The mixture of missing, incorrect, and valid depth introduces significant noise, hindering accurate perception and reliable grasping. However, most existing methods only feed RGB-D image into the network, requiring the model to implicitly distinguish between reliable and unreliable depth. While effective on training datasets, these methods struggle to generalize to real-world scenes due to the highly variable distribution of valid and invalid depth caused by complex light interactions. This motivates us to rethink the efficiency of current training strategies. To address this, we propose ReMake, a novel depth completion framework guided by instance mask and relative depth. The mask explicitly distinguishes regions from non-transparent ones, enabling the model to focus on learning to predict accurate depth in transparent areas from RGB-D input during its training process. This targeted supervision reduces reliance on implicit reasoning and improves generalization to real-world scenarios. Additionally, the relative depth map encodes spatial relationships between the transparent object and its surroundings, enhancing prediction accuracy. Extensive experiments show that our method outperforms existing approaches on both benchmark datasets and real-world scenarios, demonstrating superior accuracy and generalization capability.

Methods

Full pipeline

The instance mask and relative depth map are generated via segmentation and monocular depth estimation. The RGB image concatenated with the mask is encoded using a Transformer. The relative depth and original depth are separately encoded, and all features are fused and decoded to predict the complete depth map. The target object's point cloud is then extracted using the instance mask.

Results

We show the quatitative comparisons on the TransCG dataset.

We visualize a qualitative comparisons between our method and the State-of-the-Art methods on the TransCG dataset. Although ground truth have some noise points, our method remain robust.

We tested our method in real-world experiments. To evaluate generalization more effectively, we varied the objects, backgrounds, and viewpoints during testing, noting that viewpoint changes also alter background depth.