Amazon Web Services
Cloud Architectures
Jinesh Varia
Technology Evangelist
Amazon Web Services
(jvaria@amazon.com)
Introduction
This paper illustrates the style of building applications using services available in the Internet cloud.
Cloud Architectures are designs of software applications that use Internet-accessible on-demand services. Applications built
on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed (for example to
process a user request), draw the necessary resources on-demand (like compute servers or storage), perform a specific job,
then relinquish the unneeded resources and often dispose themselves after the job is done. While in operation the
application scales up or down elastically based on resource needs.
This paper is divided into two sections. In the first section, we describe an example of an application that is currently in
production using the on-demand infrastructure provided by Amazon Web Services. This application allows a developer to do
pattern-matching across millions of web documents. The application brings up hundreds of virtual servers on-demand, runs
a parallel computation on them using an open source distributed processing framework called Hadoop, then shuts down all
the virtual servers releasing all its resources back to the cloud—all with low programming effort and at a very reasonable
cost for the caller.
In the second section, we discuss some best practices for using each Amazon Web Service - Amazon S3, Amazon SQS,
Amazon SimpleDB and Amazon EC2 - to build an industrial-strength scalable application.
Key
s
Amazon Web Services, Amazon S3, Amazon EC2, Amazon SimpleDB, Amazon SQS, Hadoop, MapReduce, Cloud Computing
Amazon Web Services
Why Cloud Architectures?
Cloud Architectures address key difficulties surrounding
large-scale data processing. In traditional data processing
it is difficult to get as many machines as an application
needs. Second, it is difficult to get the machines when
one needs them. Third, it is difficult to distribute and co-
ordinate a large-scale job on different machines, run
processes on them, and provision another machine to
recover if one machine fails. Fourth, it is difficult to auto-
scale up and down based on dynamic workloads. Fifth, it
is difficult to get rid of all those machines when the job is
done. Cloud Architectures solve such difficulties.
Applications built on Cloud Architectures run in-the-cloud
where the physical location of the infrastructure is
determined by the provider. They take advantage of
simple APIs of Internet-accessible services that scale on-
demand, that are industrial-strength, where the complex
reliability and scalability logic of the underlying services
remains implemented and hidden inside-the-cloud. The
usage of resources in Cloud Architectures is as needed,
sometimes ephemeral or seasonal, thereby providing the
highest utilization and optimum bang for the buck.
Business Benefits of Cloud Architectures
There are some clear business benefits to building
applications using Cloud Architectures. A few of these are
listed here:
1. Almost zero upfront infrastructure investment: If you
have to build a large-scale system it may cost a
fortune to invest in real estate, hardware (racks,
machines, routers, backup power supplies),
hardware management (power management,
cooling), and operations personnel. Because of the
upfront costs, it would typically need several rounds
of management approvals before the project could
even get started. Now, with utility-style computing,
there is no fixed cost or startup cost.
2. Just-in-time Infrastructure: In the past, if you got
famous and your systems or your infrastructure did
not scale you became a victim of your own success.
Conversely, if you invested heavily and did not get
famous, you became a victim of your failure. By
deploying applications in-the-cloud with dynamic
capacity management software architects do not
have to worry about pre-procuring capacity for large-
scale systems. The solutions are low risk because
you scale only as you grow. Cloud Architectures can
relinquish infrastructure as quickly as you got them
in the first place (in minutes).
3. More efficient resource utilization: System
administrators usually worry about hardware
procuring (when they run out of capacity) and better
infrastructure utilization (when they have excess and
idle capacity). With Cloud Architectures they can
manage resources more effectively and efficiently by
having the applications request and relinquish
resources only what they need (on-demand).
4. Usage-based costing: Utility-style pricing allows
billing the customer only for the infrastructure that
has been used. The customer is not liable for the
entire infrastructure that may be in place. This is a
subtle difference between desktop applications and
web applications. A desktop application or a
traditional client-server application runs on
customer’s own infrastructure (PC or server),
whereas in a Cloud Architectures application, the
customer uses a third party infrastructure and gets
billed only for the fraction of it that was used.
5. Potential for shrinking the processing time:
Parallelization is the one of the great ways to speed
up processing. If one compute-intensive or data-
intensive job that can be run in parallel takes 500
hours to process on one machine, with Cloud
Architectures, it would be possible to spawn and
launch 500 instances and process the same job in 1
hour. Having available an elastic infrastructure
provides the application with the ability to exploit
parallelization in a cost-effective manner reducing
the total processing time.
Examples of Cloud Architectures
There are plenty of examples of applications that could
utilize the power of Cloud Architectures. These range
from back-office bulk processing systems to web
applications. Some are listed below:
• Processing Pipelines
• Document processing pipelines – convert
hundreds of thousands of documents from
Microsoft Word to PDF, OCR millions of
pages/images into raw searchable text
• Image processing pipelines – create thumbnails
or low resolution variants of an image, resize
millions of images
• Video transcoding pipelines – transcode AVI to
MPEG movies
• Indexing – create an index of web crawl data
• Data mining – perform search over millions of
records
• Batch Processing Systems
• Back-office applications (in financial, insurance
or retail sectors)
• Log analysis – analyze and generate
daily/weekly reports
• Nightly builds – perform nightly automated
builds of source code repository every night in
parallel
• Automated Unit Testing and Deployment Testing
– Test and deploy and perform automated unit
testing (functional, load, quality) on different
deployment configurations every night
• Websites
• Websites that “sleep” at night and auto-scale
during the day
• Instant Websites – websites for conferences or
events (Super Bowl, sports tournaments)
• Promotion websites
• “Seasonal Websites” - websites that only run
during the tax season or the holiday season
(“Black Friday” or Christmas)
Amazon Web Services
In this paper, we will discuss one application example in
detail - code-named as “GrepTheWeb”.
Cloud Architecture Example: GrepTheWeb
The Alexa Web Search web service allows developers to
build customized search engines against the massive
data that Alexa crawls every night. One of the features of
their web service allows users to query the Alexa search
index and get Million Search Results (MSR) back as
output. Developers can run queries that return up to 10
million results.
The resulting set, which represents a small subset of all
the documents on the web, can then be processed
further using a regular expression language. This allows
developers to filter their search results using criteria that
are not indexed by Alexa (Alexa indexes documents
based on fifty different document attributes) thereby
giving the developer power to do more sophisticated
searches. Developers can run regular expressions against
the actual documents, even when there are millions of
them, to search for patterns and retrieve the subset of
documents that matched that regular expression.
This application is currently in production at Amazon.com
and is code-named GrepTheWeb because it can “grep” (a
popular Unix command-line utility to search patterns) the
actual web documents. GrepTheWeb allows developers to
do some pretty specialized searches like selecting
documents that have a particular HTML tag or META tag
or finding documents with particular punctuations
(“Hey!”, he said. “Why Wait?”), or searching for
mathematical equations (“f(x) = ∑x + W”), source code,
e-mail addresses or other patterns such as
“(dis)integration of life”.
While the functionality is impressive, for us the way it
was built is even more so. In the next section, we will
zoom in to see different levels of the architecture of
GrepTheWeb.
Figure 1 shows a high-level depiction of the architecture.
The output of the Million Search Results Service, which is
a sorted list of links and gzipped (compressed using the
Unix gzip utility) in a single file, is given to GrepTheWeb
as input. It takes a regular expression as a second input.
It then returns a filtered subset of document links sorted
and gzipped into a single file. Since the overall process is
asynchronous, developers can get the status of their jobs
by calling GetStatus() to see whether the execution is
completed.
Performing a regular expression against millions of
documents is not trivial. Different factors could combine
to cause the processing to take lot of time:
• Regular expressions could be complex
• Dataset could be large, even hundreds of
terabytes
• Unknown request patterns, e.g., any number of
people can access the application at any given
point in time
Hence, the design goals of GrepTheWeb included to scale
in all dimensions (more powerful pattern-matching
languages, more concurrent users of common datasets,
larger datasets, better result qualities) while keeping the
costs of processing down.
The approach was to build an application that not only
scales with demand, but also without a heavy upfront
investment and without the cost of maintaining idle
machines (“downbottom”). To get a response in a
reasonable amount of time, it was important to distribute
the job into multiple tasks and to perform a Distributed
Grep operation that runs those tasks on multiple nodes in
parallel.
GrepTheWeb
Application
RegEx
Subset of
document URLs
that matched
the RegEx
Input dataset (List of
Document Urls)
GetStatus
Figure 1 : GrepTheWeb Architecture - Zoom Level 1
Amazon Web Services
Zooming in further, GrepTheWeb architecture looks like
as shown in Figure 2 (above). It uses the following AWS
components:
• Amazon S3 for retrieving input datasets and for
storing the output dataset
• Amazon SQS for durably buffering requests
acting as a “glue” between controllers
• Amazon SimpleDB for storing intermediate
status, log, and for user data about tasks
• Amazon EC2 for running a large distributed
processing Hadoop cluster on-demand
• Hadoop for distributed processing, automatic
parallelization, and job scheduling
Workflow
GrepTheWeb is modular. It does its processing in four
phases as shown in figure 3. The launch phase is
responsible for validating and initiating the processing of
a GrepTheWeb request, instantiating Amazon EC2
instances, launching the Hadoop cluster on them and
starting all the job processes. The monitor phase is
responsible for monitoring the EC2 cluster, maps,
reduces, and checking for success and failure. The
shutdown phase is responsible for billing and shutting
down all Hadoop processes and Amazon EC2 instances,
while the cleanup phase deletes Amazon SimpleDB
transient data.
Detailed Workflow for Figure 4:
1. On application start, queues are created if not already created and all the controller threads are started. Each controller
thread starts polling their respective queues for any messages.
2. When a StartGrep user request is received, a launch message is enqueued in the launch queue.
3. Launch phase: The launch controller thread picks up the launch message, and executes the launch task, updates the
status and timestamps in the Amazon SimpleDB domain, enqueues a new message in the monitor queue and deletes
the message from the launch queue after processing.
a. The launch task starts Amazon EC2 instances using a JRE pre-installed AMI , deploys required Hadoop libraries
Launch
Phase
Monitor
Phase
Shutdown
Phase
Cleanup
Phase
Amazon
SQS
Controller
Amazon
EC2
Cluster Amazon
S3
Amazon
SimpleDB
User info,
Job status info
Launch, Monitor,
Shutdown
Input
Output
Manage phases
StartGrep
RegEx
GetStatus
Input Files
(Alexa Crawl)
Get Output
Figure 3: Phases of GrepTheWeb Architecture
Figure 2: GrepTheWeb Architecture - Zoom Level 2
Amazon Web Services
and starts a Hadoop Job (run Map/Reduce tasks).
b. Hadoop runs map tasks on Amazon EC2 slave nodes in parallel. Each map task takes files (multithreaded in
background) from Amazon S3, runs a regular expression (Queue Message Attribute) against the file from
Amazon S3 and writes the match results along with a description of up to 5 matches locally and then the
combine/reduce task combines and sorts the results and consolidates the output.
c. The final results are stored on Amazon S3 in the output bucket
4. Monitor phase: The monitor controller thread picks up this message, validates the status/error in Amazon SimpleDB
and executes the monitor task, updates the status in the Amazon SimpleDB domain, enqueues a new message in the
shutdown queue and billing queue and deletes the message from monitor queue after processing.
a. The monitor task checks for the Hadoop status (JobTracker success/failure) in regular intervals, updates the
SimpleDB items with status/error and Amazon S3 output file.
5. Shutdown phase: The shutdown controller thread picks up this message from the shutdown queue, and executes the
shutdown task, updates the status and timestamps in Amazon SimpleDB domain, deletes the message from the
shutdown queue after processing.
a. The shutdown task kills the Hadoop processes, terminates the EC2 instances after getting EC2 topology
information from Amazon SimpleDB and disposes of the infrastructure.
b. The billing task gets EC2 topology information, Simple DB Box Usage, Amazon S3 file and query input and
calculates the billing and passes it to the billing service.
6. Cleanup phase: Archives the SimpleDB data with user info.
7. Users can execute GetStatus on the service endpoint to get the status of the overall system (all controllers and
Hadoop) and download the filtered results from Amazon S3 after completion.
Amazon
SimpleDB
Amazon SQS
Controller
Amazon S3
Master M
Slaves N
HDFS
Hadoop Cluster on
Amazon EC2
Launch
Queue
Monitor
Queue
Launch
Controller
Shutdown
Queue
Monitor
Controller
Billing
Queue
Shutdown
Controller
Status
DB
Output
Billing
Service
Billing
Controller
Launch
Ping
Shutdown
Insert JobID,
Status
Insert Amazon
EC2 info
Get EC2 Info
Put File
Input
Get File
Check for results
Figure 4: GrepTheWeb Architecture - Zoom Level 3
Amazon Web Services
The Use of Amazon Web Services
In the next four subsections we present rationales of use
and describe how GrepTheWeb uses AWS services.
How Was Amazon S3 Used
In GrepTheWeb, Amazon S3 acts as an input as well as
an output data store. The input to GrepTheWeb is the
web itself (compressed form of Alexa’s Web Crawl),
stored on Amazon S3 as objects and updated frequently.
Because the web crawl dataset can be huge (usually in
terabytes) and always growing, there was a need for a
distributed, bottomless persistent storage. Amazon S3
proved to be a perfect fit.
How Was Amazon SQS Used
Amazon SQS was used as message-passing mechanism
between components. It acts as “glue” that wired
different functional components together. This not only
helped in making the different components loosely
coupled, but also helped in building an overall more
failure resilient system.
Buffer
If one component is receiving and processing requests
faster than other components (an unbalanced producer
consumer situation), buffering will help make the overall
system more resilient to bursts of traffic (or load).
Amazon SQS acts as a transient buffer between two
components (controllers) of the GrepTheWeb system. If a
message is sent directly to a component, the receiver will
need to consume it at a rate dictated by the sender. For
example, if the billing system was slow or if the launch
time of the Hadoop cluster was more than expected, the
overall system would slow down, as it would just have to
wait. With message queues, sender and receiver are
decoupled and the queue service smoothens out any
“spiky” message traffic.
Isolation
Interaction between any two controllers in GrepTheWeb
is through messages in the queue and no controller
directly calls any other controller. All communication and
interaction happens by storing messages in the queue
(en-queue) and retrieving messages from the queue (de-
queue). This makes the entire system loosely coupled
and the interfaces simple and clean. Amazon SQS
provided a uniform way of transferring information
between the different application components. Each
controller’s function is to retrieve the message, process
the message (execute the function) and store the
message in other queue while they are completely
isolated from others.
Asynchrony
As it was difficult to know how much time each phase
would take to execute (e.g., the launch phase decides
dynamically how many instances need to start based on
the request and hence execution time is unknown)
Amazon SQS helped in building asynchronous systems.
Now, if the launch phase takes more time to process or
the monitor phase fails, the other components of the
system are not affected and the overall system is more
stable and highly available.
How Was Amazon SimpleDB Used
One use for a database in Cloud Architectures is to track
statuses. Since the components of the system are
asynchronous, there is a need to obtain the status of the
system at any given point in time. Moreover, since all
components are autonomous and discrete there is a need
for a query-able datastore that captures the state of the
system.
Because Amazon SimpleDB is schema-less, there is no
need to define the structure of a record beforehand.
Every controller can define its own structure and append
data to a “job” item. For example: For a given job, “run
email address regex over 10 million documents”, the
launch controller will add/update the ”launch_status”
attribute along with the ”launch_starttime”, while the
monitor controller will add/update the “monitor_status”
and ”hadoop_status” attributes with enumeration values
(running, completed, error, none). A GetStatus() call will
query Amazon SimpleDB and return the state of each
controller and also the overall status of the system.
Component services can query Amazon SimpleDB
anytime because controllers