Hyperlinking the World

While most of us snap silly candids with our cameraphones, computer vision researcher Hartmut Neven is leveraging the ubiquity of digital cameras to google the world.

For computer vision researcher Hartmut Neven, the proliferation of cameraphones is an opportunity to put his life's work into every consumer's pocket. Neven, the head of the Laboratory for Human-Machine Interfaces at the University of Southern California's Information Sciences Institute, has developed image-recognition software optimized for mobile phone microprocessors. His technology, sold through start-up Neven Vision, already powers gimmicky MMS services from Vodafone Japan and NTT DoCoMo that automatically overlay special effects like tears or a halo on cameraphone video images. The Los Angeles Police Department is testing the same underlying facial analysis technology in the form of a digital "mugshot book." Officers on the streets point a camera at a suspect to see if his or her face matches anyone in their rogues gallery.

Neven's eyes are on the future though. His long-term goal is to bring biometrics to the mobile masses and hyperlink the world through a system best described as "a visual Google."

TheFeature: What do you mean by "visual Google"?

Neven: You take a picture of something, send it to our servers, and we either provide you with more information or link you to the place that will. Let's say you're standing in front of the Mona Lisa in the Louvre. You take a snapshot with your cameraphone and instantly receive an audio-visual narrative about the painting. Then you step out of the Louvre and see a cafe. Should you go in? Take a shot from the other side of the street and a restaurant guide will appear on your phone. You sit down inside, but perhaps your French is a little rusty. You take a picture of the menu and a dictionary comes up to translate. There is a huge variety of people in these kinds of situations, from stamp collectors, to people who want to check their skin melanoma, to police officers who need to identify the person in front of them.

TheFeature: But how do you seed such a massive database of objects?

Neven: The key is to start with well-defined segments where the cost and effort of building the database is not that large. A nice rollout example would be a movie guide. If you see a billboard of a movie on a bus, you take a shot of it and then are routed to a relevant site where you can download a trailer or get show times. All we would need are images of a couple hundred billboards. The same is true with the Louvre example, where a collection of images already exists. With our technology, it doesn't take an expert to train the system to recognize an object.

TheFeature: You're planning to roll out the first version of this system in a year or so. What will it look like at the beginning?

Neven: Mobile advertising is a natural place to start this and get the kinks out. You could take a picture of the new BMW and automatically be entered into a sweepstakes. An advertising campaign would create awareness about this technology so people would learn how to use it.

There's a big leap between an automobile ad campaign and a visual database of the world though. Pulling off a visual Google is certainly a huge endeavor. We haven't fully explored scalability. We can comfortably do 100,000 objects, but can we do a million? For example, comparison shopping is an attractive application. A lady sees a handbag she likes but she wants to know the price of that handbag at other stores or see similar handbags that may be available. If you have hundreds of handbags that are only differentiated by small features, the system isn't good enough yet to discriminate.

TheFeature: What are the other technical challenges?

Neven: If you have millions of objects in the database, you can't search through every one of them looking for a match. The search strategies have to become smarter. You have to find effective ways to prune down your search early on so you're only comparing the photo you took to the most relevant sets of objects. Also, you eventually need to think about smart balancing between processing on the phone and the server. As the discrimination needs increase, you must account for finer image details. That would require sending a higher resolution image to the server, which could be expensive and clog the system. I think it would make sense to put more of the intelligence on the handset to do some pre-processing. That way, the handset would only send the necessary image features to the server where the recognition process would be completed.

TheFeature: You're also looking at ways to use your technology for secure mobile commerce.

Neven: Yes, the goal is to use cameraphones to accurately identify humans as well as objects. Right now in my laboratory, we have a working version of a single image multibiometric system. It fuses classic facial feature comparison with iris scanning and skin texture analysis. Your skin texture is like a fingerprint for your face. Our system can create a very high quality biometric signature without expensive sensors or cameras.

TheFeature: Is there really enough demand to justify a biometric system in every phone?

Neven: I think one huge application is enough to justify it. We're working with smart card vendors and credit card companies on a solution that allows for more reliable authentication of users. Take mobile banking, for instance. The system would take your picture and provide access to the banking site or not. It adds a second layer of security (above passwords). Access control is another application. At a locked door, you'd be prompted to authenticate yourself with your Bluetooth-enabled phone.

There's a formidable infrastructure out there of modern multimedia cameraphones. It's perhaps the most popular consumer device ever. Now we can inject machine vision into that infrastructure to enable new services.

It's easy to see the precursors of a persistent augmented reality environment being established with some of these ideas. It seems to me that some of these ideas could maybe be carried out more efficiently through the use of RFID tags embedded in objects like the movie billboard in his example above. Instead of having to have your camera send out for information based on image recognition, you would simply access relevant information content associated with that billboard's unique RFID number. Eventually there will be enough computer power and enough speed and size in database software to be able to enable each of us to have cameras always on, gathering information about our surroundings, and receiving a stream of relevant data about where we are and what we are looking at everywhere we go. A combination of ubiquitous RFID tags as well as hyper-efficient image recognition technology could be the basis for AR as we will know it.

Thanks to Michael.

Ben's MicroVision (MVIS) Blog

Search This Blog

Ben's MicroVision Podcast Episode 02: "MVIS DNA"

Hyperlinking the World

Comments

Post a Comment