¶ Introduction to 3D Reconstruction
Welcome to Computer Vision Decoded, where we explain the ever evolving and complex concepts in computer vision. And in this episode, I'm doing something a little different. My co-host and I... had a live meetup in Pittsburgh a few weeks ago, and we recorded our talk on the evolution of 3D reconstruction. Jared Heinle leads this talk, and he has a plethora of knowledge on this specific topic.
And in Jared's talk, he talks about where 3D reconstruction came from, where we're at today, and where it's heading. I guarantee there's something in this episode that you'll learn even if you've been working in 3D construction for your whole career. And since you're listening to this on the podcast version of Computer Vision Decoded, I do suggest tuning into the YouTube link that will be in the description of today's episode.
because there's a great visual component to it. Well, without further ado, let's get to Jared's presentation on the evolution of 3D reconstruction. To start, because I love 3D reconstruction, I thought I'd talk about a little bit about...
evolution of 3D reconstruction. So my goal here is I want to be very high level. I'm not going to dive into depth about particular techniques or methods, but I'm just trying to paint a picture for how has 3D reconstruction evolved over the decades or centuries? How have people thought about under... understanding the 3D world.
What is 3D reconstruction? You know, to me, it's digitizing reality. It's taking this 3D space around us and converting it into some representation that, in this case, a computer can understand. You know, and that can be people, it can be underwater, it can be...
buildings, that representation, it could be point clouds, it could be meshes, it could be polygons. There's so many different ways to think about 3D data. It can be incredibly tiny, you know, from microscopes image or just... commonplace everyday outdoor environments.
¶ Active and Passive Sensing Methods
When I think about 3D reconstruction, I also think about, well, how are we perceiving it? How are we sensing that world? And I'll highlight here. There's, you know, in my mind, two main ways. You've got active sensing and passive sensing. So in active sensing.
your device, your sensor, is actively sending out light or signals into the world, measuring that and coming back in a spot. So in this case, here's a screenshot from... uh the original xbox connect sensor you know which you could do for playing games or tracking human body pose that used a structured light sensor so it sent out infrared light into sort of this dot pattern and then had an offset camera that would look at the display
of those little dots of light to infer how much parallax had occurred in those different patterns and use that to infer depth. So you can send out structured light patterns. You can send out LiDAR. You're measuring the time of flight of that laser. pulse. How long does it take to come back? What's the phase of that pulse? Can infer a lot of information. Over here, I'm hinting at this laser line scanner where you can send out
A laser beam, direct it in a line, have an offset camera and see, well, how does that line project onto my surface? What's that silhouette? And use that to infer something about the shape of your scene. You also have passive techniques, which a lot of times rely on images. So this is, you are not actively sending out any information, you're just receiving. The dominant way that this is, is just through visible light, visible or invisible, you know, but those photons.
And so you can imagine taking a bunch of images of something from many different angles and using that to infer. 3d information here i'm hinting at shape from shading so even from a single photo if you just look at this from the way that the light falls off around the person's face you know it's brighter at the front as it moves around it falls off
that's both due to a light attenuation, as well as that difference between the direction of a light and the angle of that surface. And so you can use that to sort of infer, well, what was the shape of that face or the depth of the things? And so there's way even from a single photo.
to recover 3D information just based on the shading of what's there. And these photos on the right, we as humans, we're really good at understanding depth or 3D data even from a single photo. And there's a few different cues that we use. So for instance, one here on the top left, haze, fog. We kind of understand that if you see something really foggy, typically that means it's farther away than something that's crisp and clear. Or in the top right.
You know, we as humans, we like symmetry, we like lines and repetition. And so you end up with a lot of vanishing lines or vanishing points in a scene. We understand that that typically means something closer to that vanishing point is farther away than something at the opposite end of that.
Our eyes, we do have a pair of eyes, and we use that to perceive depth. But even with a single eye, you can also perceive depth based on defocus. When your eye focuses on something near versus something far, we use that defocus. data to understand what's near and far. So these keys on this keyboard are nice and crisp. The ones far away are blurry. We can understand that means there's some depth ordering to that information.
So we had active sensors. We've got passive sensors. I'm typically here going to primarily talk just about passive sensors, specifically images, because that's what's... in my mind, most prevalent today. You know, we've all got a camera in our pocket. We pull it out. We take photos. That's a great way to passively receive information about the world, and we can use that for 3D reconstruction.
¶ Historical Evolution of 3D Techniques
It's going back now to the evolution of this. So... Computer vision, 3D reconstruction, understanding the 3D world is not something that's just happened in the past several decades. You know, we as humans have been trying to understand 3D data. I mean, I didn't include it here, but, you know, back to, you know, Picasso and DaVinci and early artists as you...
moved from, you know, a certain size of painting to understanding perspective and the way that how can we represent 3D information on a 2D image. And so this is a case where... Even in the 1830s, taking two photos or two line drawings, understanding that, well, just from our eyes are different perspectives. If I take two photos from different perspectives, I can build a reproduction of an environment that someone can then look through and see.
3D. Now, it's not really 3D. I can't move around it, but you've got that stereo pair. You've got that stereoscope as early as the 1830s. Mapping played a big role in humanity's drive to understand geometry in large spaces. So in the 1860s, someone coined the term metro photography. So here they're trying to build maps of large areas.
and said, well, hey, instead of trying to do this by hand, you know, what if I can get to a nice vantage point and see the same peak on a building from multiple different perspectives? Can I triangulate that point in space, understand where I took the photo, understand where that triangulated point is? was and use that to help arrange the geography of the map. Another gentleman was doing essentially the same thing with building architecture and he called it photography.
Photogrammetry. Now that term has survived to today. Metrophotography, that term has... fallen out of favor. But here, really understanding what is the geometry of a camera and the geometry of a lens. How does light refract through that lens? You end up with an image that's upside down and flipped. How can we understand that? And again, using multiple camera positions to triangulate points on the surface of a building to understand its 3D geometry.
Jumping ahead to the early 1900s, again with mapmaking, World War I, again there. Maps were of big importance to understand the geography of the battlefields. And so aerial photography started coming into play. So using cameras, mounted aircraft, now you're taking photos from above.
And then the task became, well, how do you stitch these photos together? Earliest ones is just you can put a bunch of photos beside each other and, you know, line them all up and say, there's my map. But there's challenges with that. As an aside. You don't necessarily always need aircraft. You can use pigeons. So that was a nice sort of covert way to discover and map out terrain when you didn't want someone to see what you were taking. But back to actual geometry.
And I was saying, well, how do we understand these aerial photographs? People built mechanical devices to help them do that. So there were stereo plotters where you could load in two different images at the top that then had light shining through it. And based on the geometry, you would encode in the mechanism.
the perspective of where those photos were taken from. And then an operator would sit down here with a device and where he could move a platter up and down. And what that would do is both photos would be projected to that platter and the operator could see when do those photos sort of...
converge as he changes the depth. And then whichever depth had the best sort of convergence and alignment, that's where they would figure out, oh, this is the elevation of that point of the terrain. And so painstakingly move through the photograph, marking the depth of each. location. Now you could build up that digital elevation map of the terrain that you had just surveyed. So it's a very, very manual operation.
¶ Computers and Early Stereo Vision
30 years later, we've got computers. So now this helps automate that process. At this point, computers were not understanding the photos. It still required a human operator to look through an eyepiece to look for that alignment. But once the operator said, great, I've got alignment.
and tap a button, and the computer can digitize and record that information, do that mechanical processing that would have been before. Now the computer can digitize that, build up the 3D map there in the computer itself. But this was still only operating from a single pair of photos, two photos of a similar area with high overlap. Around the same time...
Another person was like, well, how? I've got all of these photos. This ended up being an area of Vermont that was 40 by 80 square miles, tons of photos, all overlapping. We've got so much redundancy. If I try to line up all the photos, they don't align. line in some areas, but then it drifts elsewhere. How can I make all of these photographs line up?
And so what he did is he built a mathematical model called bundle adjustment. So the idea of bundle adjustment is it's an optimization problem. You know, some of, you know, least squares problem where.
By having surveyed in ground control, so he knew certain places in Vermont, you know, what their GPS coordinates should have been, found those points in a bunch of images, and in between neighboring images, knowing that, oh, this pixel in one image should match to another pixel in another, so you end up... with a bunch of constraints, image to image or image to ground control point, and then was able to build a mathematical model that minimized all of those constraints.
And then with that, not only did you get great alignment between the images, but you got great alignment because now you could model the uneven distortion that was in that camera lens. So that camera lens is a piece of glass or other transparent material. You've got a bunch of these lens elements. There typically is some amount of distortion, spherical distortion, brow distortion, pin cushion distortion, where you end up with...
aberrations in the projection of that image. And so through bundle adjustment, he was able to compensate for that lens distortion and get really great alignment of the imagery. So here was using computers. to analyze maps or computers to help solve these mathematical models. Other early researchers were just trying to look at something entirely different. You know, how can a computer understand a photo?
We as humans, we look at a photo, we understand what's in it, but how can we program a computer to do the same thing? And so in this case... they were looking at just simple geometric primitives, saying, let's constrain our scene to be a black background with simple white objects in it. And then taking these objects, can we extract lines from them? Where are the edges of that image?
Can we infer what that object looks like? How might we render it from a different perspective? And even back in the 1960s, people were building software that detected these objects and then hypothesized what might those objects look like from different perspectives. So this person was looking primarily at polygons, lines, simple geometric primitives. Someone else said, well, no, I want to look more at layers. So there they decomposed the image into basic patterns, points, dots, lines, text.
trying to understand what parts of the image had similar texture in it, can then decompose that into different layers to understand what is the relative depth of these objects, and then based on that trying to infer something about the geometry. So, again, trying to understand how can we have a computer automatically understand what is the depth, what is that 3D information in a photo. And I mentioned before about that stereoplotter, that mechanical device that would, where a human...
had to find the alignment between a photo. Well, Here in the 70s with computers, people finally realized, no, we can have a computer do that alignment for us. And so this was the birth of stereo vision, just what our eyes do naturally. When I look at something and both my eyes see the same thing, this was a computer model.
mathematical model to understand, well, if I see four points in one eye and I see four points in the other eye, what is the relative depth of those points? And so they formulated it sort of like a graph solving problem, you know, a min cut, max cut.
graph minimization problem to say, if my two input images in this case were just... two patches of noise, how can I iteratively refine that to figure out what is the displacement between those two images that gives me both relative depth as well as smoothness in that depth. So programming these computers to understand how do I map pixels from one image to pixels to the other to understand that parallax in order to give me depth. This is back in the 1970s. Jumping forward a bit.
¶ Modern Image-Based 3D Methods
Now, a bit more to modern days, stereo vision has evolved quite a bit. So instead of depth maps like we see over here, now we can get depth maps like this. And so if I had a pair of images looking at this case as fountain, not only can we recover the depth of all those pixels, but we can recover the surface normal, you know, what's the orientation of that surface, and then render that as 3D, and it works quite well. on a variety of scenes.
These concepts of metrophotography and photogrammetry, where these surveyors were manually marking points and images and trying to triangulate that, now there's a whole field of feature extraction, feature detection, feature description, where...
software has been written to automatically find salient points, interesting repeatable points in an image. Maybe it's the corner where you see a corner or some bright object surrounded by dark or the opposite of that. Or maybe you find a blob, a dark circle. surrounded by white. But you're finding these really repeatable areas, detecting those 2D key points, describing what they look like, and then automatically discovering, well, between two images...
which pairs of key points look the most similar. And so you're able to match the corner of one part of the structure to the corner of the other part of the structure based on the high amount of visual similarity. And there's techniques to do that. We can automatically filter those matches based on just valid camera transitions. So I could say, I've got a bunch of matches between my image, but is there a valid camera transform that takes me from one image to the other?
If I can find that, I say, oh, yeah, these 90% of matches all conform to a valid camera motion. And these other 10% don't make sense. Well, we get rid of this 10%. We keep the 90% that's remaining. And say now I've got these automated matches between two images. So if I take these automated matches plus the concept of bundle adjustment. now I can do something called structure from motion. And so in this case, this is incremental structure from motion, where I'd start out with a pair of photos.
triangulate the points that I saw. So I had a pair of photos, matches between them, triangulated points. Now I come along, grab a third image. Well, how does it align? Oh, it aligns over here. Let me triangulate its points, add it in. So I can add image after image after image.
incrementally building up a 3D model. And then over time, you know, that might drift. And so that's where I go and run bundle adjustment. So bundle adjustment will optimize for the camera positions, optimize for the 3D points. corrects for that lens distortion, and you incrementally build up a 3D model of your entire scene. This is one very common technique that people use today to take images and build a 3D model of it, but it's incremental.
So there's other approaches that say, well, I don't want to wait to add all these photos one by one. If I've got all of these constraints, if I know which images match to each other, why can't I solve all those at the same time? Turns out... You can. So there's techniques called global structure for motion.
which will take all of your image's input. Sometimes they start with just a random initialization and it says minimize all constraints. And so it minimizes all the constraints. Basically these, you know, camera to point to image correspondences and you can end up with three 3D. like that. So there's pros and cons between incremental and global. Global's typically faster.
But incremental is typically more accurate. Because in incremental, you have the opportunity to correct for errors at every single step. Because sometimes there may be mismatches or images that don't really align well. And you can detect that more easily when you're doing this image by image process. Whereas if you throw them all at once, sometimes it's hard in that minimization to figure out where are those outliers and effectively remove them. But these are very common techniques.
And more recently, these techniques can scale up to really large scenes. So this is an example of the ruins in Rome. Almost 75,000 images. This was built up using incremental structure for motion. And so you end up with each of these little red dots is where a person was standing when they took the photo. And they end up with these sparse 3D points. Each of these points was triangulated from the max between two images, and you have this really cool geometry.
These are the kind of results I really like. I stare at sparse point clouds a lot. You know, it may not look photorealistic, but, you know, it's got all of the geometry there. You know, it knows where all this image is taken. I've got all the perspective. I've got all that geometry solved. You know, what's missing maybe is just some of the details in the scene. But all the raw, that skeleton is there.
So that was Structure for Motion. There was the concepts of, I mentioned before about stereo vision, taking two images and trying to estimate the depth of the pixels in those images. Well, nowadays we can do that. on a much larger scale. And so this was an example of thousands of images where once we've solved for all the camera poses, we can then do this per pixel depth estimation saying, well, in one image, I see this color. Where did that pixel image end up in another image?
search over it, find the depth, and it allows us to triangulate where it should be. And then we can fuse all those depth estimates together to generate some really nice looking results. And this is just a comparison between a previous technique and a new one that someone was proposing.
¶ Crowdsourced Large-Scale Reconstruction
Sort of as an aside, where did all this data come from? You know, so those 75,000 images of downtown Rome came from crowdsourced photo collections. You know, and so this is the notion like, yeah, there are some data sets out there that are highly curated. You know, Google Street View, they're driving their cars up and down. They're capturing the same style of imagery for that entire street or that entire city.
This says, well, no, maybe it's very expensive to have a single capture device covering such a large scale. Why can't we leverage? the phones and the cameras that people just have in their pockets. And so these projects I'm going to show here were taken via people's vacation photos. So back in the time...
There was a site, Flickr. It was owned by Yahoo. People could upload their vacation photos. And when they uploaded them, they could put a license on them. And so some of those photos were said, yep, free for commercial use or research use or whatever. And so researchers...
downloaded those photos, and then tried to build 3D models from it. And so this was a technique back in 2009 where they said, oh, we want to build Rome in a day. So they downloaded 100,000 images of Rome, threw it onto a supercomputer, and within 24 hours... built 3D models of the ruins there in Rome. Well, a year later, the research group that I was a part of, now this was right before I joined, they said, we don't need a supercomputer. We can do it on one computer. And we actually can do...
3 million images, not 100,000. So they said, we can build room on a cloudless day. You're not leveraging that cloud computer. Basically, they're just trying to streamline every single step of the reconstruction process to make it as efficient as possible. to generate all these 3D models. So that was in 2010. So that's when I started my PhD of this group. So when I was finishing up my PhD 2015, I said, I think I can do one better. So I'm not gonna build Rome in a day.
I'm going to build the world in six days. And then on the seventh day, have the computer rest. But here I downloaded 100 million images off of Flickr. Covered all over the world. And I said, had one computer. And I said, go. And then five days, four days later, it hit.
It sorted through all the photos, figured out how they're related, and then it spent two days building the 3D models of them. And then I ended up then with 12,000 3D reconstructions from various landmarks around the world. But that was a ton of fun.
¶ Machine Learning in 3D Reconstruction
So I've talked a lot about algorithms, geometry. So a lot of this was people writing algorithms, techniques, and software to understand images. Nowadays, what about machine learning? Machine learning, AI, it's all the rage. If you go to CVPR here, it's really hard to find a paper that doesn't have machine learning in it somewhere. So what are people using machine learning for? Well... Back to monocular depth estimation. From a single folder, there is no geometry.
A lot of times you need two photos to understand depth. I did say, hey, there's some depth cues, vanishing lines or some other context clues that we might pick up on. But to write software to do that, sometimes it's finicky. Whereas machine learning models, they're pretty good at this. So from a single photo, machine learning models can understand the depth of a scene. And so you take that image's input, it runs through that neural net and is able to output depth.
This is just for illustration purposes. But machine learning is also great at stereo depth. So you can take a pair of images as input and do that same thing. Something that's really nice about machine learning is that it can leverage multiple sources of information to generate that final result. So not only do you have sort of the geometric cues, those perspective cues of, well, how did that patch of pixels change from image to image?
You know, what is that disparity? What's the parallax? I can triangulate depth. Other models are using context to understand, well, what am I looking at? Am I looking at something that looks like a nice, flat, smooth surface? If so, well, then maybe my depth should also be flat. You know, am I looking at something that looks like a basketball?
If so, well, then I should make sure it's a sphere. And so you're able, this machine learning models are able to better understand context clues and the environment to generate those 3D maps. Whereas previous techniques... there's they were relying on sort of per pixel constraints saying i found a pixel in one image can i find that pixel in another image machine learning is able to bridge the gap between both sort of that perception and the understanding and the semantics
So this again is just only understanding the depth of a photo. There's a whole other class of techniques which say, no, we don't need geometry. We can just do this all ourselves. Machine learning can understand a scene. all by itself, which I think is actually kind of impressive. So there's a class of techniques, Duster followed up by Master, where you just give it a bunch of images and out pops a 3D model. And so it...
is regressing over those images, trying to figure out, well, what was the camera pose? Where were those images taken? And then what is sort of the underlying or un-underlying 3D model that supports the imagery that was seen? And so, for instance, here, 32 images orbiting a pyramid, and then out pops, in this case, a point cloud, a very dense point cloud that shows the geometry of what that model thinks it saw.
from those images. And it runs pretty fast. Again, now this requires a pretty beefy GPU, but it works. Here's another scene kind of walking along a building facade. Take 64 images input. and you're able to recover the geometry. So not requiring any structure for motion, not requiring any explicit depth input, just take those images in, out pops 3D model.
¶ Neural Radiance Fields and Splatting
Going a little bit different direction. So a lot of these results that I've shown here are actually point clouds. So when we estimate that depth, we got that pixel. We project it out in the 3D space. That pixel has color. So it's just XYZ RGB. You know, depth, 3D position, and color.
Well, there are a lot of other different representations for 3D data, some of which are better or easier for machine learning. So someone had the idea of, well, we're using machine learning to generate a point cloud. You know, we're using a neural net to generate some 3D data. What if the neural net was the 3D data?
And so this is neural radiance fields. So a neural radiance field, you know, is the 3D reconstruction. And so what happens is you give it a bunch of images input and you tell it, well, here with all, you know, you give it. the images as well as the poses of those images. So it assumes you already have an understanding of how the images were captured. But then what the machine learning model is trying to do is understand at each point in space...
You know, what is the color and transparency of that point? And so you formulate as a set of queries. You say for each pixel in an image, let me trace that ray into the scene and ask the machine learning model, well, what's the color here? What's the color here? What's the color here? And you keep asking it all along the ray, get all of those colors, you sum them up, and that then is the color of the pixel.
Well, each of these images has a bunch of pixels, and so you can use all of these constraints to constrain that model so that it can learn sort of an accurate representation of the entire scene. So there is no... 3D data here. You say, well, where's the point cloud? Where's the mesh? There is none. It's a neural net that has sort of memorized how that scene looks like at every point in space. It's a volumetric representation sort of compressed into machine learning model.
Here's an example. This is from a blog that Jonathan does, who posts videos online. Great tutorials. This is NVIDIA's Instant Nerf. And so what I'm trying to show here is that machine learning model running and learning in real time. You know, so right now this bridge is kind of fuzzy, but every second that goes by...
It's iteratively getting better and better and better as that machine learning model just continues its iterations. Doing gradient descent, trying to understand I've got all these constraints. How can I keep minimizing the error? in that network. So in this case, in machine learning model, was the 3D data? Well, that's kind of cumbersome because I would really like to actually have some sort of 3D representation that I can manipulate.
So different researchers said, well, instead of having the model sort of memorize what the data is, what if I had the model manipulate the data in the scene and come up with a 3D representation for me? So this is a technique called Gaussian splatting, where... Kind of like a point cloud in that you have a bunch of points in space, but now instead of just XYZ RGB, each point now has a shape, a size, and transparency, which is represented via a 3D Gaussian. So like a blob.
And so here, the machine learning model, I won't call it machine learning. It's basically a big optimization problem, saying I've got a bunch of images from known perspectives. How can I optimize a set of Gaussians to summarize the appearance of the scene? So again, you reformulate it as a big optimization problem, give it all of the images as input, and it will iteratively move these Gaussians around, stretch them, shrink them, split them.
inject new ones, remove bad ones, to converge to a scene that looks very realistic. And so its goal is not really to generate accurate geometry, but its goal is to generate an accurate... visualization. So to reproduce those photos as best as possible. So here's another example that Jonathan's showing where we've, as an outdoor scene, we had a drone orbiting this piece of equipment and...
the 3D data ends up looking pretty darn realistic. And so you're not just constrained to where my photo's taken from. You can move around, zoom in, read all that text, see the dirt, see the rust, see all the patterns. all over that equipment. And now you can see, yeah, maybe in the background, it's kind of fuzzy. All that was blurry, and so it ends up blurry in your 3D reconstruction. But this is a great way to generate a visualization for inspection of a scene.
This is Gaussian splatting. And that was that. So these Gaussians, these blobs. So different researchers said, well, why do I need Gaussians? Our video cards or graphics cards, they love triangles. Why can't we just do this with triangles directly? Sure enough, you can. So you can, again, formulate that optimization problem. Instead of optimizing over Gaussians, let's just optimize over triangles. And you can get some really good results that way, too. And they call that triangle splotting.
So just to kind of summarize back here with some of these machine learning techniques, there are some subtle differences. So one, with this end-to-end neural representation... all that it needed as input was the images so that machine learning model was able to recover the pose of the images as well as the 3d representation while just whereas something like that nerf that neural radiance field or that gaussian splat
It needed the camera poses. It said, I need to know where these images are taken from because that's not my job. And so then it's optimizing the geometry inside of it. So then for the output... When you have this end-to-end neural reconstruction, give that image as input and outcomes, depth maps, camera poses, 3D point clouds, you know, what is the representation of the scene. Whereas with a Nerf...
You know, you don't get a 3D representation. All you get is just the machine learning model that you can query and say, hey, for this particular point, from this direction, what was my color? Out pops RGB. So for a Nerf, the output is that model, whereas with a Gaussian splat, the output is that point cloud, that set of Gaussians, or if it was a triangle splat, the set of triangles. It's a little bit different there, but...
In both cases, using this optimization or insights from machine learning to come up with optimized 3D representations of the scene. And so that's where I just kind of want to leave it.
¶ Diverse Applications and Future Outlook
Just saying that there are many ways that people are working on trying to understand the 3D world. You know, what is the type of sensor? You know, I didn't even talk about active sensing in the way that, you know, LiDAR or active depth sensors. You know, your iPhone Pro, iPhone 12 Pro and newer.
has got a LiDAR sensor built right into the thing. You know, so that is an active sensor that gives you depth for free. It's only good out to five meters, but that's really powerful in the hands of a consumer in their home. A lot of great ways that you can do there. Autonomous driving
We're in Pittsburgh, you know, robotics, understanding stereo cameras. There's a lot of ways to fuse 3D data with odometry data, you know, from your vehicle, your wheel encoders, whatever that may be. There's lots of ways to apply 3D. reconstruction to many domains, both large and small. So for me, I love 3D computer vision. I really appreciate you all coming today. And yeah, again, thank you.
Well, there you have it. I hope you learned something in this episode. And if you liked it, please subscribe because that lets us know we're on the right track for what you like to listen to. With that, I'll see you in the next episode.
