MIT Research News' Journal
 
[Most Recent Entries] [Calendar View]

Friday, December 2nd, 2016

    Time Event
    12:00a
    Computer learns to recognize sounds by watching video

    In recent years, computers have gotten remarkably good at recognizing speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook.

    But recognition of natural sounds — such as crowds cheering or waves crashing — has lagged behind. That’s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data. Usually, the training data has to be first annotated by hand, which is prohibitively expensive for all but the highest-demand applications.

    Sound recognition may be catching up, however, thanks to researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). At the Neural Information Processing Systems conference next week, they will present a sound-recognition system that outperforms its predecessors but didn’t require hand-annotated data during training.

    Instead, the researchers trained the system on video. First, existing computer vision systems that recognize scenes and objects categorized the images in the video. The new system then found correlations between those visual categories and natural sounds.

    “Computer vision has gotten so good that we can transfer it to other domains,” says Carl Vondrick, an MIT graduate student in electrical engineering and computer science and one of the paper’s two first authors. “We’re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabeled video to learn to understand sound.”

    The researchers tested their system on two standard databases of annotated sound recordings, and it was between 13 and 15 percent more accurate than the best-performing previous system. On a data set with 10 different sound categories, it could categorize sounds with 92 percent accuracy, and on a data set with 50 categories it performed with 74 percent accuracy. On those same data sets, humans are 96 percent and 81 percent accurate, respectively.

    “Even humans are ambiguous,” says Yusuf Aytar, the paper’s other first author and a postdoc in the lab of MIT professor of electrical engineering and computer science Antonio Torralba. Torralba is the final co-author on the paper.

    “We did an experiment with Carl,” Aytar says. “Carl was looking at the computer monitor, and I couldn’t see it. He would play a recording and I would try to guess what it was. It turns out this is really, really hard. I could tell indoor from outdoor, basic guesses, but when it comes to the details — ‘Is it a restaurant?’ — those details are missing. Even for annotation purposes, the task is really hard.”

    Complementary modalities

    Because it takes far less power to collect and process audio data than it does to collect and process visual data, the researchers envision that a sound-recognition system could be used to improve the context sensitivity of mobile devices.

    When coupled with GPS data, for instance, a sound-recognition system could determine that a cellphone user is in a movie theater and that the movie has started, and the phone could automatically route calls to a prerecorded outgoing message. Similarly, sound recognition could improve the situational awareness of autonomous robots.

    “For instance, think of a self-driving car,” Aytar says. “There’s an ambulance coming, and the car doesn’t see it. If it hears it, it can make future predictions for the ambulance — which path it’s going to take — just purely based on sound.”

    Visual language

    The researchers’ machine-learning system is a neural network, so called because its architecture loosely resembles that of the human brain. A neural net consists of processing nodes that, like individual neurons, can perform only rudimentary computations but are densely interconnected. Information — say, the pixel values of a digital image — is fed to the bottom layer of nodes, which processes it and feeds it to the next layer, which processes it and feeds it to the next layer, and so on. The training process continually modifies the settings of the individual nodes, until the output of the final layer reliably performs some classification of the data — say, identifying the objects in the image.

    Vondrick, Aytar, and Torralba first trained a neural net on two large, annotated sets of images: one, the ImageNet data set, contains labeled examples of images of 1,000 different objects; the other, the Places data set created by Oliva's group and Torralba's group, contains labeled images of 401 different scene types, such as a playground, bedroom, or conference room.

    Once the network was trained, the researchers fed it the video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. “It’s about 2 million unique videos,” Vondrick says. “If you were to watch all of them back to back, it would take you about two years.” Then they trained a second neural network on the audio from the same videos. The second network’s goal was to correctly predict the object and scene tags produced by the first network.

    The result was a network that could interpret natural sounds in terms of image categories. For instance, it might determine that the sound of birdsong tends to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders.

    Benchmarking

    To compare the sound-recognition network’s performance to that of its predecessors, however, the researchers needed a way to translate its language of images into the familiar language of sound names. So they trained a simple machine-learning system to associate the outputs of the sound-recognition network with a set of standard sound labels.

    For that, the researchers did use a database of annotated audio — one with 50 categories of sound and about 2,000 examples. Those annotations had been supplied by humans. But it’s much easier to label 2,000 examples than to label 2 million. And the MIT researchers’ network, trained first on unlabeled video, significantly outperformed all previous networks trained solely on the 2,000 labeled examples.

    “With the modern machine-learning approaches, like deep learning, you have many, many trainable parameters in many layers in your neural-network system,” says Mark Plumbley, a professor of signal processing at the University of Surrey. “That normally means that you have to have many, many examples to train that on. And we have seen that sometimes there’s not enough data to be able to use a deep-learning system without some other help. Here the advantage is that they are using large amounts of other video information to train the network and then doing an additional step where they specialize the network for this particular task. That approach is very promising because it leverages this existing information from another field.”

    Plumbley says that both he and colleagues at other institutions have been involved in efforts to commercialize sound recognition software for applications such as home security, where it might, for instance, respond to the sound of breaking glass. Other uses might include eldercare, to identify potentially alarming deviations from ordinary sound patterns, or to control sound pollution in urban areas. “I really think that there’s a lot of potential in the sound-recognition area,” he says.

    5:00a
    A radiation-free approach to imaging molecules in the brain

    Scientists hoping to get a glimpse of molecules that control brain activity have devised a new probe that allows them to image these molecules without using any chemical or radioactive labels.

    Currently the gold standard approach to imaging molecules in the brain is to tag them with radioactive probes. However, these probes offer low resolution and they can’t easily be used to watch dynamic events, says Alan Jasanoff, an MIT professor of biological engineering and brain and cognitive sciences.

    Jasanoff and his colleagues have developed new sensors consisting of proteins designed to detect a particular target, which causes them to dilate blood vessels in the immediate area. This produces a change in blood flow that can be imaged with magnetic resonance imaging (MRI) or other imaging techniques.

    “This is an idea that enables us to detect molecules that are in the brain at biologically low levels, and to do that with these imaging agents or contrast agents that can ultimately be used in humans,” Jasanoff says. “We can also turn them on and off, and that’s really key to trying to detect dynamic processes in the brain.”

    In a paper appearing in the Dec. 2 issue of Nature Communications, Jasanoff and his colleagues used these probes to detect enzymes called proteases, but their ultimate goal is to use them to monitor the activity of neurotransmitters, which act as chemical messengers between brain cells.

    The paper’s lead authors are postdoc Mitul Desai and former MIT graduate student Adrian Slusarczyk. Recent MIT graduate Ashley Chapin and postdoc Mariya Barch are also authors of the paper.

    Indirect imaging

    To make their probes, the researchers modified a naturally occurring peptide called calcitonin gene-related peptide (CGRP), which is active primarily during migraines or inflammation. The researchers engineered the peptides so that they are trapped within a protein cage that keeps them from interacting with blood vessels. When the peptides encounter proteases in the brain, the proteases cut the cages open and the CGRP causes nearby blood vessels to dilate. Imaging this dilation with MRI allows the researchers to determine where the proteases were detected.

    “These are molecules that aren’t visualized directly, but instead produce changes in the body that can then be visualized very effectively by imaging,” Jasanoff says.

    Proteases are sometimes used as biomarkers to diagnose diseases such as cancer and Alzheimer’s disease. However, Jasanoff’s lab used them in this study mainly to demonstrate the validity their approach. Now, they are working on adapting these imaging agents to monitor neurotransmitters, such as dopamine and serotonin, that are critical to cognition and processing emotions.

    To do that, the researchers plan to modify the cages surrounding the CGRP so that they can be removed by interaction with a particular neurotransmitter.

    “What we want to be able to do is detect levels of neurotransmitter that are 100-fold lower than what we’ve seen so far. We also want to be able to use far less of these molecular imaging agents in organisms. That’s one of the key hurdles to trying to bring this approach into people,” Jasanoff says.

    Jeff Bulte, a professor of radiology and radiological science at the Johns Hopkins School of Medicine, described the technique as “original and innovative,” while adding that its safety and long-term physiological effects will require more study.

    “It’s interesting that they have designed a reporter without using any kind of metal probe or contrast agent,” says Bulte, who was not involved in the research. “An MRI reporter that works really well is the holy grail in the field of molecular and cellular imaging.”

    Tracking genes

    Another possible application for this type of imaging is to engineer cells so that the gene for CGRP is turned on at the same time that a gene of interest is turned on. That way, scientists could use the CGRP-induced changes in blood flow to track which cells are expressing the target gene, which could help them determine the roles of those cells and genes in different behaviors. Jasanoff’s team demonstrated the feasibility of this approach by showing that implanted cells expressing CGRP could be recognized by imaging.

    “Many behaviors involve turning on genes, and you could use this kind of approach to measure where and when the genes are turned on in different parts of the brain,” Jasanoff says.

    His lab is also working on ways to deliver the peptides without injecting them, which would require finding a way to get them to pass through the blood-brain barrier. This barrier separates the brain from circulating blood and prevents large molecules from entering the brain.

    The research was funded by the National Institutes of Health BRAIN Initiative, the MIT Simons Center for the Social Brain, and fellowships from the Boehringer Ingelheim Fonds and the Friends of the McGovern Institute.

    12:55p
    Lincoln Laboratory is honored with the Herschel Award

    At the recent 2016 Military Sensing Symposia (MSS) Detectors and Materials Conference, MIT Lincoln Laboratory was presented with the Herschel Award for its development of digital-pixel readout integrated circuits. The Herschel Award, given by the MSS Specialty Group on Detectors and Materials, recognizes a major breakthrough in infrared device science or technology.

    Digital-pixel readout integrated circuits can achieve very high dynamic range, high sensitivity, and fast data rate compared to conventional technology. The MIT Lincoln Laboratory team received the award for its work on the Vital Infrared Sensor Technology Acceleration (VISTA) program. The VISTA program was a tri-service program managed by the U.S. Army's Night Vision and Electronic Sensors Directorate; it was stood up to improve the nation's capability in advanced infrared focal plane array technology for military sensors. 

    The Herschel Award, named for Sir William Herschel, a British astronomer who is credited with discovering infrared radiation, is one of the most prestigious awards offered by the MSS. It is not bestowed annually but is given when the selection committee, composed of leading members of the infrared scientific community, determines an individual or organization has contributed a significant advancement to the infrared science and industry community. Until this year, the Herschel Award had not been presented since 2011.

    "This award is the result of the hard work of many engineers over nearly 15 years. It is because of the dedication and hard work of some of the most exceptionally talented engineers that we have been able to bring the digital-pixel dream to a reality," said Michael Kelly, associate leader of Lincoln Laboratory's Advanced Imager Technology Group and a principal investigator on the development team. "We have been very fortunate to have received strong support from a number of U.S. government sponsors. In particular, the VISTA program ‎provided our team with the platform and funding needed to rapidly mature the technology and make the advances for which we have been recognized with this award," he added. 

    The MSS is a set of conferences on military sensing technologies. These conferences, managed by specialty groups, address a broad range of sensing domains, including detectors and materials, radar, infrared countermeasures, passive and active electro-optical sensors, missile defense sensors, and more. The MSS runs a fellowship program to recognize individuals whose body of technical work contributed significantly to military sensing and an awards program to recognize individuals and organizations who have made important breakthroughs in sensing technologies.

    << Previous Day 2016/12/02
    [Calendar]
    Next Day >>

MIT Research News   About LJ.Rossia.org