Extractable Mobile Photo Tags
Ramesh Jain
Department of
Computer Science
University of
California,Irvine
Irvine, CA 92697
jain@ics.uci.edu
Mingyan Gao
Department of
Computer Science
University of
California,Irvine
Irvine, CA 92697
gaom@ics.uci.edu
Setareh Rafatirad
Department of
Computer Science
University of
California,Irvine
Irvine, CA 92697
srafatir@ics.uci.edu
Pinaki Sinha
Department of
Computer Science
University of
California,Irvine
Irvine, CA 92697
psinha@ics.uci.edu
ABSTRACT
Mobile phones are resulting in a major shift in how people
shoot photos. Just a little more than a decade ago consumer
behavior was plan-shoot-process-share-organize-reflect;
but rapid proliferation of mobile phone cameras have
resulted in shoot-share-forget behavior. This trend will be
replaced soon because photos are more important than that
– people treasure their memories in visual form.
Fortunately, a plethora of sensors combined with access to
powerful Web may allow effortless organize and reflect
environment without much, if any, cognitive load on the
consumer. We propose new approaches for determining
attributes that we call Extractable Mobile Photo Tags
(EMPT) for processing and organizing photos and videos
on mobile phones. We present approaches to populate
EMPT and use it for applications.
Author Keywords
Personal Photo Organization / Management, Contextual
Information, Extractable Mobile Photo Tags.
ACM Classification Keywords
I.2.4 [ARTIFICIAL INTELIGENCE]: Knowledge
Representation Formalisms and Methods – Semantic
Networks;
H.5.1 [INFORMATION INTERFACES AND
PRESENTATION]: Multimedia Information Systems -
Evaluation/methodology.
General Terms
Algorithms, Management, Experimentation,Human Factors.
INTRODUCTION
Photo management as recently as 10 year ago was a very
different problem than it is today. Our techniques are still
from the old world, however. At one time cameras were
used to take photographs that represented intensity values at
a point on the film, or at a pixel on a CCD array.
Emergence of digital cameras, particularly those in smart
phones, has radically changed the nature of photography
and photo-taking habits of people. The radical changes
have come along two dimensions. First, due to the ease or
capturing photos and relatively no cost associated with
photos, people take many photos and store them. Second,
these devices capture a host of contextual information,
commonly called metadata, like time, location, camera
parameters, and voice tags along with the media itself.
There are several sources that feed different kind of
contextual information to a phone ranging from location
information to calendar, contacts, and information in
clouds, as shown in Figure 1. When a photo is taken from
this camera the system can effortlessly add advanced meta
information that we will call Extractable Mobile Photo
Tags or EMPT to the photo header.
Figure 1: Data sources from phone and Web are used to
populate EMPT fields.
In fact, a photo is no longer just an array of intensity values;
it is experiential data associated with an event during which
the photo is captured. Cameras capture significant
metadata associated with photos and much more can be
inferred from other sources. This metadata is much less
noisy compared to the human induced tags in online image
sharing platforms. However, the challenge lies in correctly
interpreting this multi-modal data to determine useful
attributes for organizing and annotating photos for easy
retrieval, reliving, and reflection. In this paper we discuss
multiple approaches being developed in our lab to populate
EMPT for a photo. We also present our efforts to develop
summarization approach for showing summary of photos
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
MLBS’11, September 18, 2011, Beijing, China.
Copyright 2011 ACM 978-1-4503-0928-8/11/09...$10.00.
3
on the phones so that the visualization and management of a
large number of photos could become an enjoyable task.
EXTRACTABLE MOBILE PHOTO TAGS
For the photos captured by smartphones like the
Android, it is possible to add several semantic fields to
each photo. These semantic fields could be added
when the photo is captured using background
processing for each photo. We started identifying such
fields, listed below, based on the availability as well as
efficacy in photo management and search.
Photographer Name: This information will be obtained
from the ownership of the camera.
People in Photos: Several fields will be added here:
Numbers and Names of People, Portrait photos or crowd.
Place: In addition to lat-long-alt; Type of place; Name of
place.
Event: Using calendar and event detection techniques, the
system will detect and store Type and Name and other
details of event.
Environment: Based on location and time, system will
determine and store environment conditions such as
Weather class (Cloudy, rain, ..), ambient noise (loud, …) ,
etc.
Objects/Things: Using computer vision techniques
combined with strong context, determine objects such as
Pets and other favorite things in photos.
Scene Concepts: Modify current computer vision concepts
of scenes [5] (beach, outdoor, city, mountains, ..) and
develop context guided techniques for learning these
concepts in photos and use them as enumerated concepts.
Time: In addition to the clock and data time, one should
also consider storing personal (birthdays, anniversaries, ..),
local festivals (Chinese New Year, Deepavali, Easter, ..),
and other significant symbolic time indicators.
For the fields that the value can not be inferred reliably,
‘UNKNOWN’ will be stored so that a user may supply that
value if desired.
In the following, we will briefly discuss research projects to
populate EMPT and use it for various applications.
RECOGNITION OF EVENTS AND SIBEVENT
STRUCTURE
Given photos with EXIF metadata for an event, we partition
them into its sub-events. We use domain event ontology
corresponding to the type of the event, instantiate the
domain ontology using information available for the event
(i.e. time, location, participating people), and augment the
ontology instance with all available information related to
the context of the subevents. Domain Ontologies are
nothing but formal conceptual models at the “semantic”
level that are independent from lower level data models [6].
High-level semantics (e.g., an event that a photo covers) are
linguistic descriptions and a linguistic description is almost
always contextual [7].By augmentation, we mean
associating values to an individual-event context. This
instantiated contextual model, called R-Ontology, is shown
in Figure 2. R-Ontology is then used to partition the given
photos.
The corresponding domain event ontology describes the
underlying event and its subevents in terms of their
parthood relationship (i.e. subevent-of), relative temporal
relationships (i.e. previous-event, next-event, started-by-
event, finished-by-event), environment, scene concepts, and
object/things.
R-Ontology is derived from such event model. The
relationships in R-Ontology are described with properties
such as negation, transitivity, etc; subevent-of, next-event,
previous-event are examples of relationships with transitive
property. In addition to that, we consider some inference
rules, for instance,consider the following case described in
the domain ontology:
previous-event (e1,e3), next-event (e1,e2) →
started-by(e1,e3.end), finished-by(e1,e2.start).
Figure 2: R-Ontology
In the first line e1 is a event that is temporally and relatively
bounded by time-intervals of events e2 and e3. Relationship
next-event/previous-event indicate the very next/previous
event (temporally) after/before the occurrence of e1/e2
respectively. In the second line, predicates are inferred from
the first line that shows the predicates in the corresponding
domain ontology. This shows that our framework supports
temporal proximity by translating relative temporal
relations into absolute relations in R-Ontology.
4
Partitioning is conducted by function f as follows:
f : (P,e,O) ¾> H, (1)
s.t. H=È (
), (1.1)
Pi Í P, sej Î R-Ontology,
" Pi associate(Pi , {sej}) ;
f : fextract * fcluster * fmatch (2)
Figure 3: General Architecture
In (1), P is a set of photos given for event e, and O is a
lookup table with two fields name, and path for available
event domain ontologies in our framework. The lookup
table is searched by the name of ontology (that matches the
type of e) to get its URI. Function f that is a complex
function (2) associates each Pi (a subset of P) with a set of
subevents ({sej}); if this set is empty at the end of f, the
corresponding field in the image is set to UNKOWN that
means no subevent in the underlying R-Ontology covers
subset Pi. The final result is a set of pairs (H) indicated in
(1.1).
Each Pi is initially represented as a tuple with the following
fields: that
means it belongs to a hierarchical structure. The
hierarchical structure is generated from running an
agglomerative spatiotemporal clustering (fcluster) on P --using
time and location from EXIF metadata-- such that none of
the direct children of Pi overlap the other; then each cluster
is matched to some subevent/(s) according to its/their
absolute time and location upon availability.
Finally, fmatch selects a subset of images from those in the
matched clusters to a subevent, by applying a set of
constraints. These constraints are based on properties
environment, scene concepts, and object/things
corresponding to the subevent. For this part, we obtained
the corresponding properties of pictures used in constraint
matching by fextract, before fmatch begins.
Figure3 shows the general architecture for this framework.
Figure4 shows some results for a vacation trip. Given all
the data that is available in mobile phones, we believe that
this approach can be very effective.
For mobile smart-phones, we have initiated extending the
idea of using R-Ontology to real-time scenario. The
information detected by R-Ontology includes the actual
event that covers a particular photo; the actual event may
belong to an arbitrary level in subevent structure that is
associated to the high-level event. All other context
information comes with the actual event.
Figure 4. (a)Top 3 rows:Shopping;
(b)Bottom two rows:Talk
WONDER WHAT
Motivation
Life events, such as exhibitions, music lives, festivals,
tours, nightlife, sports and various community activities
constitute important parts of our everyday life. Correctly
recognizing such events allows us to draw proper inferences
about people and objects related to the events. Event
recognition by mobile visual search has a great number of
applications. For example, it is useful for travelers to get
immediate information about local events. To get quick
knowledge about these events, by using a mobile visual
search tool, travelers need only focus their cameras on
events to query and receive relevant event information.
Events detected based on primary spatio-temporal context
and content in turn provide a secondary context for
recognizing objects and people involved in the events. For
instance, at a conference, to get the information of the
5
current speaker, you can simply take a picture of his
presentation. Now given the photo, the visual search is
applied to determine the event, based on which names and
faces of relevant people can be detected, compared, ranked
and returned to you. Besides, with the increasing amount of
online social information, such as friend relationship and
social event announcements on Facebook, this technique
can be employed to recognize social events and friends as
well.
We discuss a system that provides users with information of
the public events that they are attending by analyzing in real
time their photos taken at the event. Whenever a user wants
to know about an event she is currently at, she only needs to
take a picture of it. By examining the photo content
together with the spatial and temporal data carried with it,
our system automatically returns a ranked list of events
with which the photo may be associated. Our approach has
the following advantages. 1) Use of the system is very
intuitive and requires no special efforts; 2) The system
keeps a dedicated event database and index, and
automatically constructs queries for users, which enables
the delivery of exact event information; 3) Our system not
only detects planned events, but also tries to discover
concurrent events by analyzing real-time micro-blogs; 4)
Different types of events do reveal distinct visual
characteristics, so visual content is also taken into account
to improve search results. As far as we know, there is no
previous work that has addressed a similar problem.
Problem
We formulate the problem which serves as the basis for the
following discussion.
Contextual Photo
A contextual photo is represented as a triple p = (img, time,
location), where img is the image content of the photo p,
and time is the timestamp when the photo p was taken, and
location = (latitude, longitude) corresponds to the geo-
coordinate of the photo shooting location. In our problem,
time and location jointly identify a unique spatio-temporal
context under which the photo was created. A contextual
photo p is an input to our system.
Event
We follow the proposal in [1], and denote an event as a
tuple e = (time, location, title, description, type, media).
time = (start, end) is the time interval during which e
occurs. location = (lat1, long1, lat2, long2) represents the
geo-coordinates of the southwest and the northeast corner
of the place where e takes place. Name of event e is stored
as string in title, and the textual explanation of e is saved in
description. Event type indicating the class and genre of
events, such as performances, exhibitions, sports and
politics, are stored in type. Media data associated with some
events, such as posters, photos and videos, is kept in media.
Problem Formulation
Given a contextual photo p, an event ranking function H is
represented as:
H : p × E -> R where set E = {e1, ..., en} is the event space,
and each ei ∈ E is an event as defined in 3.1.2. R is the
event ranking value space. The value of H(p, ei) represents
the likelihood that ei is the event at which the photo p was
taken. Now given an input contextual photo p and an event
ranking function H, the event recognition problem is to
return an ordered list of events (ei1, ..., ein), in which H(p,
ei1) _... _ H(p, ein). In this work, we consider spatial,
temporal and visual features in event ranking function H.
Each feature is again a ranking function h : p × E -> R. The
final output ranking value is computed as H(p, ei) =
t=13wtht (p, ei), where wt is the weight associated with
feature ht. We will explain the details of these features and
their combination in the following discussions.
Implementation
Figure 5. Architecture of WonderWhat system.
We present the system architecture in Figure5. Our system
consists of the following major steps.
1. We create an event database, and ingest both planned and
emergent events into it. Planned events, which are usually
pre-declared online, are extracted from web pages or
downloaded via web services that perform event
integration. Emergent events, the occurrence of which is
impromptu, are detected from Twitter.
2. After a user takes a photo for an event, her device creates
a contextual photo containing the image content, time and
location information. The device then sends this contextual
photo to our system as a query.
3. First, given the time and location information in the
contextual photo, a query is issued to the event database,
which returns a list of related events.
4. Then the content analysis component analyzes the image
content and returns the event type of the event captured in
the photo. In this work, we model the relationship between
event types and the raw visual features through a middle
layer of visual concepts. We employed a learning based
approach to perform the analysis, which consists of four
major steps:
1) Train concept detector;
6
2) Detect concepts from photos associated with different
event types;
3) Train event type detector;
4) For each incoming photo, based on the models, decide
which type of event the photo is most likely to be.
5. Both the event list from event database and the detected
event type are given to the ranking component. The
component considers spatial, temporal, and visual distances
in the final ranking process.
6. Finally, a ranked list of events and their associated
information are returned to the user and presented on her
device.
Experiments
We conducted experiments on both Flickr dataset and a real
event photo set shot in New York City.
Flickr Dataset
In this experiment, we verify the hypothesis that people do
take photos at events. And by making use of the taking time
and location of photos, we are able to match them to the
corresponding events. We built the event database for
events in NYC from year 2008 to March 2011. Also, we
called the Flickr API and downloaded all the photos shot
from year 2008 till March 2011. We matched the photos to
the events in the event database. Figure 6 shows examples
of matched events and photos. The left column details the
events and the URLs where these events were extracted,
and the right column lists the photos taken at the events.
Figure 6. Examples of matched events and photos.
Real Photo Set
In this section, we test on real photo sets collected from 4
volunteers living in NYC. We asked each volunteer to hang
around on streets in NYC during their spare time, in August
and September 2010, and try attending some events that
they discovered. They were advised to take as many
pictures as possible at the event, and there were no
requirements on the subjects of these photos. The ranking
result is depicted in Figure 7. The photo column shows a
sample of pictures from each photo set, and the result
column lists the top 5 ranking result for most pictures in the
photo set. The last column provides the ground truth of
these events. For event 1, 3, and 4, we correctly returned
the information of the corresponding events on the first
place in ranked lists. But for event 2, since the exact event
was not stored in our database, our system returned a
musical event in the Mitzi Newhouse Theater of Lincoln
Center, which was a very close match.
Figure 7. Results on real photo set.
PHOTO COLLECTION SUMMARIZATION
Manually sifting through large collection of personal photos
for creating summaries is not only a tedious and inefficient
task but also tests human patience. In mobile phones, there
are other sets of constraints which makes managing large
number of photos a real challenge. First, the screen real
estate is limited. Secondly, photos are mostly shot to share
with others. Availability of requisite network bandwidth in
mobile phones to enable sharing of rich media data is also a
big challenge. Hence the need of a system to