为了正常的体验网站,请在浏览器设置里面开启Javascript功能!

p3

2012-01-06 7页 pdf 3MB 22阅读

用户头像

is_386633

暂无简介

举报
p3 Extractable Mobile Photo Tags Ramesh Jain Department of Computer Science University of California,Irvine Irvine, CA 92697 jain@ics.uci.edu Mingyan Gao Department of Computer Science University of California,Irvine Irvine, CA 92697 gaom@ics...
p3
Extractable Mobile Photo Tags Ramesh Jain Department of Computer Science University of California,Irvine Irvine, CA 92697 jain@ics.uci.edu Mingyan Gao Department of Computer Science University of California,Irvine Irvine, CA 92697 gaom@ics.uci.edu Setareh Rafatirad Department of Computer Science University of California,Irvine Irvine, CA 92697 srafatir@ics.uci.edu Pinaki Sinha Department of Computer Science University of California,Irvine Irvine, CA 92697 psinha@ics.uci.edu ABSTRACT Mobile phones are resulting in a major shift in how people shoot photos. Just a little more than a decade ago consumer behavior was plan-shoot-process-share-organize-reflect; but rapid proliferation of mobile phone cameras have resulted in shoot-share-forget behavior. This trend will be replaced soon because photos are more important than that – people treasure their memories in visual form. Fortunately, a plethora of sensors combined with access to powerful Web may allow effortless organize and reflect environment without much, if any, cognitive load on the consumer. We propose new approaches for determining attributes that we call Extractable Mobile Photo Tags (EMPT) for processing and organizing photos and videos on mobile phones. We present approaches to populate EMPT and use it for applications. Author Keywords Personal Photo Organization / Management, Contextual Information, Extractable Mobile Photo Tags. ACM Classification Keywords I.2.4 [ARTIFICIAL INTELIGENCE]: Knowledge Representation Formalisms and Methods – Semantic Networks; H.5.1 [INFORMATION INTERFACES AND PRESENTATION]: Multimedia Information Systems - Evaluation/methodology. General Terms Algorithms, Management, Experimentation,Human Factors. INTRODUCTION Photo management as recently as 10 year ago was a very different problem than it is today. Our techniques are still from the old world, however. At one time cameras were used to take photographs that represented intensity values at a point on the film, or at a pixel on a CCD array. Emergence of digital cameras, particularly those in smart phones, has radically changed the nature of photography and photo-taking habits of people. The radical changes have come along two dimensions. First, due to the ease or capturing photos and relatively no cost associated with photos, people take many photos and store them. Second, these devices capture a host of contextual information, commonly called metadata, like time, location, camera parameters, and voice tags along with the media itself. There are several sources that feed different kind of contextual information to a phone ranging from location information to calendar, contacts, and information in clouds, as shown in Figure 1. When a photo is taken from this camera the system can effortlessly add advanced meta information that we will call Extractable Mobile Photo Tags or EMPT to the photo header. Figure 1: Data sources from phone and Web are used to populate EMPT fields. In fact, a photo is no longer just an array of intensity values; it is experiential data associated with an event during which the photo is captured. Cameras capture significant metadata associated with photos and much more can be inferred from other sources. This metadata is much less noisy compared to the human induced tags in online image sharing platforms. However, the challenge lies in correctly interpreting this multi-modal data to determine useful attributes for organizing and annotating photos for easy retrieval, reliving, and reflection. In this paper we discuss multiple approaches being developed in our lab to populate EMPT for a photo. We also present our efforts to develop summarization approach for showing summary of photos Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MLBS’11, September 18, 2011, Beijing, China. Copyright 2011 ACM 978-1-4503-0928-8/11/09...$10.00. 3 on the phones so that the visualization and management of a large number of photos could become an enjoyable task. EXTRACTABLE MOBILE PHOTO TAGS For the photos captured by smartphones like the Android, it is possible to add several semantic fields to each photo. These semantic fields could be added when the photo is captured using background processing for each photo. We started identifying such fields, listed below, based on the availability as well as efficacy in photo management and search. Photographer Name: This information will be obtained from the ownership of the camera. People in Photos: Several fields will be added here: Numbers and Names of People, Portrait photos or crowd. Place: In addition to lat-long-alt; Type of place; Name of place. Event: Using calendar and event detection techniques, the system will detect and store Type and Name and other details of event. Environment: Based on location and time, system will determine and store environment conditions such as Weather class (Cloudy, rain, ..), ambient noise (loud, …) , etc. Objects/Things: Using computer vision techniques combined with strong context, determine objects such as Pets and other favorite things in photos. Scene Concepts: Modify current computer vision concepts of scenes [5] (beach, outdoor, city, mountains, ..) and develop context guided techniques for learning these concepts in photos and use them as enumerated concepts. Time: In addition to the clock and data time, one should also consider storing personal (birthdays, anniversaries, ..), local festivals (Chinese New Year, Deepavali, Easter, ..), and other significant symbolic time indicators. For the fields that the value can not be inferred reliably, ‘UNKNOWN’ will be stored so that a user may supply that value if desired. In the following, we will briefly discuss research projects to populate EMPT and use it for various applications. RECOGNITION OF EVENTS AND SIBEVENT STRUCTURE Given photos with EXIF metadata for an event, we partition them into its sub-events. We use domain event ontology corresponding to the type of the event, instantiate the domain ontology using information available for the event (i.e. time, location, participating people), and augment the ontology instance with all available information related to the context of the subevents. Domain Ontologies are nothing but formal conceptual models at the “semantic” level that are independent from lower level data models [6]. High-level semantics (e.g., an event that a photo covers) are linguistic descriptions and a linguistic description is almost always contextual [7].By augmentation, we mean associating values to an individual-event context. This instantiated contextual model, called R-Ontology, is shown in Figure 2. R-Ontology is then used to partition the given photos. The corresponding domain event ontology describes the underlying event and its subevents in terms of their parthood relationship (i.e. subevent-of), relative temporal relationships (i.e. previous-event, next-event, started-by- event, finished-by-event), environment, scene concepts, and object/things. R-Ontology is derived from such event model. The relationships in R-Ontology are described with properties such as negation, transitivity, etc; subevent-of, next-event, previous-event are examples of relationships with transitive property. In addition to that, we consider some inference rules, for instance,consider the following case described in the domain ontology: previous-event (e1,e3), next-event (e1,e2) → started-by(e1,e3.end), finished-by(e1,e2.start). Figure 2: R-Ontology In the first line e1 is a event that is temporally and relatively bounded by time-intervals of events e2 and e3. Relationship next-event/previous-event indicate the very next/previous event (temporally) after/before the occurrence of e1/e2 respectively. In the second line, predicates are inferred from the first line that shows the predicates in the corresponding domain ontology. This shows that our framework supports temporal proximity by translating relative temporal relations into absolute relations in R-Ontology. 4 Partitioning is conducted by function f as follows: f : (P,e,O) ¾> H, (1) s.t. H=È (), (1.1) Pi Í P, sej Î R-Ontology, " Pi associate(Pi , {sej}) ; f : fextract * fcluster * fmatch (2) Figure 3: General Architecture In (1), P is a set of photos given for event e, and O is a lookup table with two fields name, and path for available event domain ontologies in our framework. The lookup table is searched by the name of ontology (that matches the type of e) to get its URI. Function f that is a complex function (2) associates each Pi (a subset of P) with a set of subevents ({sej}); if this set is empty at the end of f, the corresponding field in the image is set to UNKOWN that means no subevent in the underlying R-Ontology covers subset Pi. The final result is a set of pairs (H) indicated in (1.1). Each Pi is initially represented as a tuple with the following fields: that means it belongs to a hierarchical structure. The hierarchical structure is generated from running an agglomerative spatiotemporal clustering (fcluster) on P --using time and location from EXIF metadata-- such that none of the direct children of Pi overlap the other; then each cluster is matched to some subevent/(s) according to its/their absolute time and location upon availability. Finally, fmatch selects a subset of images from those in the matched clusters to a subevent, by applying a set of constraints. These constraints are based on properties environment, scene concepts, and object/things corresponding to the subevent. For this part, we obtained the corresponding properties of pictures used in constraint matching by fextract, before fmatch begins. Figure3 shows the general architecture for this framework. Figure4 shows some results for a vacation trip. Given all the data that is available in mobile phones, we believe that this approach can be very effective. For mobile smart-phones, we have initiated extending the idea of using R-Ontology to real-time scenario. The information detected by R-Ontology includes the actual event that covers a particular photo; the actual event may belong to an arbitrary level in subevent structure that is associated to the high-level event. All other context information comes with the actual event. Figure 4. (a)Top 3 rows:Shopping; (b)Bottom two rows:Talk WONDER WHAT Motivation Life events, such as exhibitions, music lives, festivals, tours, nightlife, sports and various community activities constitute important parts of our everyday life. Correctly recognizing such events allows us to draw proper inferences about people and objects related to the events. Event recognition by mobile visual search has a great number of applications. For example, it is useful for travelers to get immediate information about local events. To get quick knowledge about these events, by using a mobile visual search tool, travelers need only focus their cameras on events to query and receive relevant event information. Events detected based on primary spatio-temporal context and content in turn provide a secondary context for recognizing objects and people involved in the events. For instance, at a conference, to get the information of the 5 current speaker, you can simply take a picture of his presentation. Now given the photo, the visual search is applied to determine the event, based on which names and faces of relevant people can be detected, compared, ranked and returned to you. Besides, with the increasing amount of online social information, such as friend relationship and social event announcements on Facebook, this technique can be employed to recognize social events and friends as well. We discuss a system that provides users with information of the public events that they are attending by analyzing in real time their photos taken at the event. Whenever a user wants to know about an event she is currently at, she only needs to take a picture of it. By examining the photo content together with the spatial and temporal data carried with it, our system automatically returns a ranked list of events with which the photo may be associated. Our approach has the following advantages. 1) Use of the system is very intuitive and requires no special efforts; 2) The system keeps a dedicated event database and index, and automatically constructs queries for users, which enables the delivery of exact event information; 3) Our system not only detects planned events, but also tries to discover concurrent events by analyzing real-time micro-blogs; 4) Different types of events do reveal distinct visual characteristics, so visual content is also taken into account to improve search results. As far as we know, there is no previous work that has addressed a similar problem. Problem We formulate the problem which serves as the basis for the following discussion. Contextual Photo A contextual photo is represented as a triple p = (img, time, location), where img is the image content of the photo p, and time is the timestamp when the photo p was taken, and location = (latitude, longitude) corresponds to the geo- coordinate of the photo shooting location. In our problem, time and location jointly identify a unique spatio-temporal context under which the photo was created. A contextual photo p is an input to our system. Event We follow the proposal in [1], and denote an event as a tuple e = (time, location, title, description, type, media). time = (start, end) is the time interval during which e occurs. location = (lat1, long1, lat2, long2) represents the geo-coordinates of the southwest and the northeast corner of the place where e takes place. Name of event e is stored as string in title, and the textual explanation of e is saved in description. Event type indicating the class and genre of events, such as performances, exhibitions, sports and politics, are stored in type. Media data associated with some events, such as posters, photos and videos, is kept in media. Problem Formulation Given a contextual photo p, an event ranking function H is represented as: H : p × E -> R where set E = {e1, ..., en} is the event space, and each ei ∈ E is an event as defined in 3.1.2. R is the event ranking value space. The value of H(p, ei) represents the likelihood that ei is the event at which the photo p was taken. Now given an input contextual photo p and an event ranking function H, the event recognition problem is to return an ordered list of events (ei1, ..., ein), in which H(p, ei1) _... _ H(p, ein). In this work, we consider spatial, temporal and visual features in event ranking function H. Each feature is again a ranking function h : p × E -> R. The final output ranking value is computed as H(p, ei) = t=13wtht (p, ei), where wt is the weight associated with feature ht. We will explain the details of these features and their combination in the following discussions. Implementation Figure 5. Architecture of WonderWhat system. We present the system architecture in Figure5. Our system consists of the following major steps. 1. We create an event database, and ingest both planned and emergent events into it. Planned events, which are usually pre-declared online, are extracted from web pages or downloaded via web services that perform event integration. Emergent events, the occurrence of which is impromptu, are detected from Twitter. 2. After a user takes a photo for an event, her device creates a contextual photo containing the image content, time and location information. The device then sends this contextual photo to our system as a query. 3. First, given the time and location information in the contextual photo, a query is issued to the event database, which returns a list of related events. 4. Then the content analysis component analyzes the image content and returns the event type of the event captured in the photo. In this work, we model the relationship between event types and the raw visual features through a middle layer of visual concepts. We employed a learning based approach to perform the analysis, which consists of four major steps: 1) Train concept detector; 6 2) Detect concepts from photos associated with different event types; 3) Train event type detector; 4) For each incoming photo, based on the models, decide which type of event the photo is most likely to be. 5. Both the event list from event database and the detected event type are given to the ranking component. The component considers spatial, temporal, and visual distances in the final ranking process. 6. Finally, a ranked list of events and their associated information are returned to the user and presented on her device. Experiments We conducted experiments on both Flickr dataset and a real event photo set shot in New York City. Flickr Dataset In this experiment, we verify the hypothesis that people do take photos at events. And by making use of the taking time and location of photos, we are able to match them to the corresponding events. We built the event database for events in NYC from year 2008 to March 2011. Also, we called the Flickr API and downloaded all the photos shot from year 2008 till March 2011. We matched the photos to the events in the event database. Figure 6 shows examples of matched events and photos. The left column details the events and the URLs where these events were extracted, and the right column lists the photos taken at the events. Figure 6. Examples of matched events and photos. Real Photo Set In this section, we test on real photo sets collected from 4 volunteers living in NYC. We asked each volunteer to hang around on streets in NYC during their spare time, in August and September 2010, and try attending some events that they discovered. They were advised to take as many pictures as possible at the event, and there were no requirements on the subjects of these photos. The ranking result is depicted in Figure 7. The photo column shows a sample of pictures from each photo set, and the result column lists the top 5 ranking result for most pictures in the photo set. The last column provides the ground truth of these events. For event 1, 3, and 4, we correctly returned the information of the corresponding events on the first place in ranked lists. But for event 2, since the exact event was not stored in our database, our system returned a musical event in the Mitzi Newhouse Theater of Lincoln Center, which was a very close match. Figure 7. Results on real photo set. PHOTO COLLECTION SUMMARIZATION Manually sifting through large collection of personal photos for creating summaries is not only a tedious and inefficient task but also tests human patience. In mobile phones, there are other sets of constraints which makes managing large number of photos a real challenge. First, the screen real estate is limited. Secondly, photos are mostly shot to share with others. Availability of requisite network bandwidth in mobile phones to enable sharing of rich media data is also a big challenge. Hence the need of a system to
/
本文档为【p3】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索