The basic idea is to use SAM to create a generic object mask so we can exclude the background.
The next step is to generate a depth image. Here we use the awesome ZoeDepth to get realistic depth from the color image.
With depth, color, and an object mask we have everything needed to create a colored point cloud of the object from a single view
MCC encodes the colored points and then creates a reconstruction by sweeping through the volume, querying the network for occupancy and color at each point.
This is a really great example of how a lot of cool solutions are built these days; by stringing together more targeted pre-trained models.The details of the three building blocks can be found in the respective papers: