With Google Research Award, Amin Karbasi Sorts Out The Internet

03/04/2016

Let's say you go to France and take a lot of photos, a few of which you want to send to your friend. You don't send a slew of photos of the Eiffel Tower. To give the full story of your trip, you send one from the Louvre, one from the Notre Dame Cathedral, and of course, one of the Eiffel Tower.

That's easy enough with a personal camera. But try doing that with the Internet. On YouTube, 300 minutes of footage are uploaded every minute. Instagram users post 220,000 photos per minute minute, and Facebook generates 2.5 million pieces of content per minute. So sorting this all out is a big task.

Google, which certainly has an interest in making sense of this data, has recently awarded Amin Karbasi with a Google Research Award. Karbasi, who has been with Yale since 2014, frequently meets with Google officials to talk about his research.

"Even the simplest machine learning methods that people use, when you gather tens of millions of data points aren't going to work efficiently," said Karbasi, assistant professor of electrical engineering & computer science. "My research is on how you can turn this bigger data into smaller, but representative, data."

One method Karbasi is looking at involves choosing elements from a particular dataset that fall into a category, but aren't overly similar.

"We are trying to come up with algorithms that can do this kind of thing fast," he said. "What we do is we represent every image by a data point, or a vector, and then we can define distances between the vectors." He compares the data points to molecules of a gas – they're far from each other, but fill the entire space.

To do this, Karbasi's research team applied their method on a publicly available dataset, called "tiny images," which contains 80 million images crawled from the web. "What we wanted to do was summarize this data - if you want to pick 100 images, which ones? We came up with algorithms that can do this very fast."

They developed a distributed algorithm that chops the data into small pieces so that each piece can be performed on a single computer. "And then we merge the results, and do something intelligent with them," he said.

Performing the same task with classical algorithms, he said, would take a very long time. With his lab's computer resources, it took only a few hours. With Google's resources, he said it could take only a few seconds.