MIT Research News' Journal

A new model of vision

When we open our eyes, we immediately see our surroundings in great detail. How the brain is able to form these richly detailed representations of the world so quickly is one of the biggest unsolved puzzles in the study of vision.

Scientists who study the brain have tried to replicate this phenomenon using computer models of vision, but so far, leading models only perform much simpler tasks such as picking out an object or a face against a cluttered background. Now, a team led by MIT cognitive scientists has produced a computer model that captures the human visual system’s ability to quickly generate a detailed scene description from an image, and offers some insight into how the brain achieves this.

“What we were trying to do in this work is to explain how perception can be so much richer than just attaching semantic labels on parts of an image, and to explore the question of how do we see all of the physical world,” says Josh Tenenbaum, a professor of computational cognitive science and a member of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds, and Machines (CBMM).

The new model posits that when the brain receives visual input, it quickly performs a series of computations that reverse the steps that a computer graphics program would use to generate a 2D representation of a face or other object. This type of model, known as efficient inverse graphics (EIG), also correlates well with electrical recordings from face-selective regions in the brains of nonhuman primates, suggesting that the primate visual system may be organized in much the same way as the computer model, the researchers say.

Ilker Yildirim, a former MIT postdoc who is now an assistant professor of psychology at Yale University, is the lead author of the paper, which appears today in Science Advances. Tenenbaum and Winrich Freiwald, a professor of neurosciences and behavior at Rockefeller University, are the senior authors of the study. Mario Belledonne, a graduate student at Yale, is also an author.

Inverse graphics

Decades of research on the brain’s visual system has studied, in great detail, how light input onto the retina is transformed into cohesive scenes. This understanding has helped artificial intelligence researchers develop computer models that can replicate aspects of this system, such as recognizing faces or other objects.

“Vision is the functional aspect of the brain that we understand the best, in humans and other animals,” Tenenbaum says. “And computer vision is one of the most successful areas of AI at this point. We take for granted that machines can now look at pictures and recognize faces very well, and detect other kinds of objects.”

However, even these sophisticated artificial intelligence systems don’t come close to what the human visual system can do, Yildirim says.

“Our brains don’t just detect that there’s an object over there, or recognize and put a label on something,” he says. “We see all of the shapes, the geometry, the surfaces, the textures. We see a very rich world.”

More than a century ago, the physician, physicist, and philosopher Hermann von Helmholtz theorized that the brain creates these rich representations by reversing the process of image formation. He hypothesized that the visual system includes an image generator that would be used, for example, to produce the faces that we see during dreams. Running this generator in reverse would allow the brain to work backward from the image and infer what kind of face or other object would produce that image, the researchers say.

However, the question remained: How could the brain perform this process, known as inverse graphics, so quickly? Computer scientists have tried to create algorithms that could perform this feat, but the best previous systems require many cycles of iterative processing, taking much longer than the 100 to 200 milliseconds the brain requires to create a detailed visual representation of what you’re seeing. Neuroscientists believe perception in the brain can proceed so quickly because it is implemented in a mostly feedforward pass through several hierarchically organized layers of neural processing.

The MIT-led team set out to build a special kind of deep neural network model to show how a neural hierarchy can quickly infer the underlying features of a scene — in this case, a specific face. In contrast to the standard deep neural networks used in computer vision, which are trained from labeled data indicating the class of an object in the image, the researchers’ network is trained from a model that reflects the brain’s internal representations of what scenes with faces can look like.

Their model thus learns to reverse the steps performed by a computer graphics program for generating faces. These graphics programs begin with a three-dimensional representation of an individual face and then convert it into a two-dimensional image, as seen from a particular viewpoint. These images can be placed on an arbitrary background image. The researchers theorize that the brain’s visual system may do something similar when you dream or conjure a mental image of someone’s face.

The researchers trained their deep neural network to perform these steps in reverse — that is, it begins with the 2D image and then adds features such as texture, curvature, and lighting, to create what the researchers call a “2.5D” representation. These 2.5D images specify the shape and color of the face from a particular viewpoint. Those are then converted into 3D representations, which don’t depend on the viewpoint.

“The model gives a systems-level account of the processing of faces in the brain, allowing it to see an image and ultimately arrive at a 3D object, which includes representations of shape and texture, through this important intermediate stage of a 2.5D image,” Yildirim says.

Model performance

The researchers found that their model is consistent with data obtained by studying certain regions in the brains of macaque monkeys. In a study published in 2010, Freiwald and Doris Tsao of Caltech recorded the activity of neurons in those regions and analyzed how they responded to 25 different faces, seen from seven different viewpoints. That study revealed three stages of higher-level face processing, which the MIT team now hypothesizes correspond to three stages of their inverse graphics model: roughly, a 2.5D viewpoint-dependent stage; a stage that bridges from 2.5 to 3D; and a 3D, viewpoint-invariant stage of face representation.

“What we show is that both the quantitative and qualitative response properties of those three levels of the brain seem to fit remarkably well with the top three levels of the network that we’ve built,” Tenenbaum says.

The researchers also compared the model’s performance to that of humans in a task that involves recognizing faces from different viewpoints. This task becomes harder when researchers alter the faces by removing the face’s texture while preserving its shape, or distorting the shape while preserving relative texture. The new model’s performance was much more similar to that of humans than computer models used in state-of-the-art face-recognition software, additional evidence that this model may be closer to mimicking what happens in the human visual system.

“This work is exciting because it introduces interpretable stages of intermediate representation into a feedforward neural network model of face recognition,” says Nikolaus Kriegeskorte, a professor of psychology and neuroscience at Columbia University, who was not involved in the research. “Their approach merges the classical idea that vision inverts a model of how the image was generated, with modern deep feedforward networks. It’s very interesting that this model better explains neural representations and behavioral responses.”

The researchers now plan to continue testing the modeling approach on additional images, including objects that aren’t faces, to investigate whether inverse graphics might also explain how the brain perceives other kinds of scenes. In addition, they believe that adapting this approach to computer vision could lead to better-performing AI systems.

“If we can show evidence that these models might correspond to how the brain works, this work could lead computer vision researchers to take more seriously and invest more engineering resources in this inverse graphics approach to perception,” Tenenbaum says. “The brain is still the gold standard for any kind of machine that sees the world richly and quickly.”

The research was funded by the Center for Brains, Minds, and Machines at MIT, the National Science Foundation, the National Eye Institute, the Office of Naval Research, the New York Stem Cell Foundation, the Toyota Research Institute, and Mitsubishi Electric.

New approach to sustainable building takes shape in Boston

A new building about to take shape in Boston’s Roxbury area could, its designers hope, herald a new way of building residential structures in cities.

Designed by architects from MIT and the design and construction firm Placetailor, the five-story building’s structure will be made from cross-laminated timber (CLT), which eliminates most of the greenhouse-gas emissions associated with standard building materials. It will be assembled on site mostly from factory-built subunits, and it will be so energy-efficient that its net carbon emissions will be essentially zero.

Most attempts to quantify a building’s greenhouse gas contributions focus on the building’s operations, especially its heating and cooling systems. But the materials used in a building’s construction, especially steel and concrete, are also major sources of carbon emissions and need to be included in any realistic comparison of different types of construction.

Wood construction has tended to be limited to single-family houses or smaller apartment buildings with just a few units, narrowing the impact that it can have in urban areas. But recent developments — involving the production of large-scale wood components, known as mass timber; the use of techniques such as cross-laminated timber; and changes in U.S. building codes — now make it possible to extend wood’s reach into much larger buildings, potentially up to 18 stories high.

Several recent buildings in Europe have been pushing these limits, and now a few larger wooden buildings are beginning to take shape in the U.S. as well. The new project in Boston will be one of the largest such residential buildings in the U.S. to date, as well as one of the most innovative, thanks to its construction methods.

Described as a Passive House Demonstration Project, the Boston building will consist of 14 residential units of various sizes, along with a ground-floor co-working space for the community. The building was designed by Generate Architecture and Technologies, a startup company out of MIT and Harvard University, headed by John Klein, in partnership with Placetailor, a design, development, and construction company that has specialized in building net-zero-energy and carbon-neutral buildings for more than a decade in the Boston area.

Klein, who has been a principal investigator in MIT’s Department of Architecture and now serves as CEO of Generate, says that large buildings made from mass timber and assembled using the kit-of-parts approach he and his colleagues have been developing have a number of potential advantages over conventionally built structures of similar dimensions. For starters, even when factoring in the energy used in felling, transporting, assembling, and finishing the structural lumber pieces, the total carbon emissions produced would be less than half that of a comparable building made with conventional steel or concrete. Klein, along with collaborators from engineering firm BuroHappold Engineering and ecological market development firm Olifant, will be presenting a detailed analysis of these lifecycle emissions comparisons later this year at the annual Passive and Low Energy Architecture (PLEA) conference in A Coruña, Spain, whose theme this year is “planning post-carbon cities.”

For that study, Klein and his co-authors modeled nine different versions of an eight-story mass-timber building, along with one steel and one concrete version of the building, all with the same overall scale and specifications. Their analysis showed that materials for the steel-based building produced the most greenhouse emissions; the concrete version produced 8 percent less than that; and one version of the mass-timber building produced 53 percent less.

The first question people tend to ask about the idea of building tall structures out of wood is: What about fire? But Klein says this question has been thoroughly studied, and tests have shown that, in fact, a mass-timber building retains its structural strength longer than a comparable steel-framed building. That’s because the large timber elements, typically a foot thick or more, are made by gluing together several layers of conventional dimensioned lumber. These will char on the outside when exposed to fire, but the charred layer actually provides good insulation and protects the wood for an extended period. Steel buildings, by contrast, can collapse suddenly when the temperature of the fire approaches steel’s melting point and causes it to soften.

The kit-based approach that Generate and Placetailor have developed, which the team calls Model-C, means that in designing a new building, it’s possible to use a series of preconfigured modules, assembled in different ways, to create a wide variety of structures of different sizes and for different uses, much like assembling a toy structure out of LEGO blocks. These subunits can be built in factories in a standardized process and then trucked to the site and bolted together. This process can reduce the impact of weather by keeping much of the fabrication process indoors in a controlled environment, while minimizing the construction time on site and thus reducing the construction’s impact on the neighborhood.

Animation depicts the process of assembling the mass-timber building from a set of factory-built components. Courtesy of Generate Architecture and Technologies

“It’s a way to rapidly deploy these kinds of projects through a standardized system,” Klein says. “It’s a way to build rapidly in cities, using an aesthetic that embraces offsite industrial construction.”

Because the thick wood structural elements are naturally very good insulators, the Roxbury building’s energy needs for heating and cooling are reduced compared to conventional construction, Klein says. They also produce very good acoustic insulation for its occupants. In addition, the building is designed to have solar panels on its roof, which will help to offset the building’s energy use.

The team won a wood innovation grant in 2018 from the U.S. Forest Service, to develop a mass-timber based system for midscale housing developments. The new Boston building will be the first demonstration project for the system they developed.

“It’s really a system, not a one-off prototype,” Klein says. With the on-site assembly of factory-built modules, which includes fully assembled bathrooms with the plumbing in place, he says the basic structure of the building can be completed in only about one week per floor.

“We're all aware of the need for an immediate transition to a zero-carbon economy, and the building sector is a prime target,” says Andres Bernal SM ’13, Placetailor’s director of architecture. “As a company that has delivered only zero-carbon buildings for over a decade, we're very excited to be working with CLT/mass timber as an option for scaling up our approach and sharing the kit-of-parts and lessons learned with the rest of the Boston community.”

With U.S. building codes now allowing for mass timber buildings of up to 18 stories, Klein hopes that this building will mark the beginning of a new boom in wood-based or hybrid construction, which he says could help to provide a market for large-scale sustainable forestry, as well as for sustainable, net-zero energy housing.

“We see it as very competitive with concrete and steel for buildings of between eight and 12 stories,” he says. Such buildings, he adds, are likely to have great appeal, especially to younger generations, because “sustainability is very important to them. This provides solutions for developers, that have a real market differentiation.”

He adds that Boston has set a goal of building thousands of new units of housing, and also a goal of making the city carbon-neutral. “Here’s a solution that does both,” he says.

The project team included Evan Smith and Colin Booth at Placetailor Development; in addition to Klein, Zlatan Sehovic, Chris Weaver, John Fechtel, Jaehun Woo, and Clarence Yi-Hsien Lee at Generate Design; Andres Bernal, Michelangelo LaTona, Travis Anderson, and Elizabeth Hauver at Placetailor Design; Laura Jolly and Evan Smith at Placetailor Construction; Paul Richardson and Wolf Mangelsdorf at Burohappold; Sonia Barrantes and Jacob Staub at Ripcord Engineering; and Brian Kuhn and Caitlin Gamache at Code Red.