为了正常的体验网站,请在浏览器设置里面开启Javascript功能!

amazon云计算介绍

2010-03-15 14页 pdf 280KB 9阅读

用户头像

is_650882

暂无简介

举报
amazon云计算介绍 Amazon Web Services Cloud Architectures Jinesh Varia Technology Evangelist Amazon Web Services (jvaria@amazon.com) Introduction This paper illustrates the style of building applicatio...
amazon云计算介绍
Amazon Web Services Cloud Architectures Jinesh Varia Technology Evangelist Amazon Web Services (jvaria@amazon.com) Introduction This paper illustrates the style of building applications using services available in the Internet cloud. Cloud Architectures are designs of software applications that use Internet-accessible on-demand services. Applications built on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed (for example to process a user request), draw the necessary resources on-demand (like compute servers or storage), perform a specific job, then relinquish the unneeded resources and often dispose themselves after the job is done. While in operation the application scales up or down elastically based on resource needs. This paper is divided into two sections. In the first section, we describe an example of an application that is currently in production using the on-demand infrastructure provided by Amazon Web Services. This application allows a developer to do pattern-matching across millions of web documents. The application brings up hundreds of virtual servers on-demand, runs a parallel computation on them using an open source distributed processing framework called Hadoop, then shuts down all the virtual servers releasing all its resources back to the cloud—all with low programming effort and at a very reasonable cost for the caller. In the second section, we discuss some best practices for using each Amazon Web Service - Amazon S3, Amazon SQS, Amazon SimpleDB and Amazon EC2 - to build an industrial-strength scalable application. Keys Amazon Web Services, Amazon S3, Amazon EC2, Amazon SimpleDB, Amazon SQS, Hadoop, MapReduce, Cloud Computing Amazon Web Services Why Cloud Architectures? Cloud Architectures address key difficulties surrounding large-scale data processing. In traditional data processing it is difficult to get as many machines as an application needs. Second, it is difficult to get the machines when one needs them. Third, it is difficult to distribute and co- ordinate a large-scale job on different machines, run processes on them, and provision another machine to recover if one machine fails. Fourth, it is difficult to auto- scale up and down based on dynamic workloads. Fifth, it is difficult to get rid of all those machines when the job is done. Cloud Architectures solve such difficulties. Applications built on Cloud Architectures run in-the-cloud where the physical location of the infrastructure is determined by the provider. They take advantage of simple APIs of Internet-accessible services that scale on- demand, that are industrial-strength, where the complex reliability and scalability logic of the underlying services remains implemented and hidden inside-the-cloud. The usage of resources in Cloud Architectures is as needed, sometimes ephemeral or seasonal, thereby providing the highest utilization and optimum bang for the buck. Business Benefits of Cloud Architectures There are some clear business benefits to building applications using Cloud Architectures. A few of these are listed here: 1. Almost zero upfront infrastructure investment: If you have to build a large-scale system it may cost a fortune to invest in real estate, hardware (racks, machines, routers, backup power supplies), hardware management (power management, cooling), and operations personnel. Because of the upfront costs, it would typically need several rounds of management approvals before the project could even get started. Now, with utility-style computing, there is no fixed cost or startup cost. 2. Just-in-time Infrastructure: In the past, if you got famous and your systems or your infrastructure did not scale you became a victim of your own success. Conversely, if you invested heavily and did not get famous, you became a victim of your failure. By deploying applications in-the-cloud with dynamic capacity management software architects do not have to worry about pre-procuring capacity for large- scale systems. The solutions are low risk because you scale only as you grow. Cloud Architectures can relinquish infrastructure as quickly as you got them in the first place (in minutes). 3. More efficient resource utilization: System administrators usually worry about hardware procuring (when they run out of capacity) and better infrastructure utilization (when they have excess and idle capacity). With Cloud Architectures they can manage resources more effectively and efficiently by having the applications request and relinquish resources only what they need (on-demand). 4. Usage-based costing: Utility-style pricing allows billing the customer only for the infrastructure that has been used. The customer is not liable for the entire infrastructure that may be in place. This is a subtle difference between desktop applications and web applications. A desktop application or a traditional client-server application runs on customer’s own infrastructure (PC or server), whereas in a Cloud Architectures application, the customer uses a third party infrastructure and gets billed only for the fraction of it that was used. 5. Potential for shrinking the processing time: Parallelization is the one of the great ways to speed up processing. If one compute-intensive or data- intensive job that can be run in parallel takes 500 hours to process on one machine, with Cloud Architectures, it would be possible to spawn and launch 500 instances and process the same job in 1 hour. Having available an elastic infrastructure provides the application with the ability to exploit parallelization in a cost-effective manner reducing the total processing time. Examples of Cloud Architectures There are plenty of examples of applications that could utilize the power of Cloud Architectures. These range from back-office bulk processing systems to web applications. Some are listed below: • Processing Pipelines • Document processing pipelines – convert hundreds of thousands of documents from Microsoft Word to PDF, OCR millions of pages/images into raw searchable text • Image processing pipelines – create thumbnails or low resolution variants of an image, resize millions of images • Video transcoding pipelines – transcode AVI to MPEG movies • Indexing – create an index of web crawl data • Data mining – perform search over millions of records • Batch Processing Systems • Back-office applications (in financial, insurance or retail sectors) • Log analysis – analyze and generate daily/weekly reports • Nightly builds – perform nightly automated builds of source code repository every night in parallel • Automated Unit Testing and Deployment Testing – Test and deploy and perform automated unit testing (functional, load, quality) on different deployment configurations every night • Websites • Websites that “sleep” at night and auto-scale during the day • Instant Websites – websites for conferences or events (Super Bowl, sports tournaments) • Promotion websites • “Seasonal Websites” - websites that only run during the tax season or the holiday season (“Black Friday” or Christmas) Amazon Web Services In this paper, we will discuss one application example in detail - code-named as “GrepTheWeb”. Cloud Architecture Example: GrepTheWeb The Alexa Web Search web service allows developers to build customized search engines against the massive data that Alexa crawls every night. One of the features of their web service allows users to query the Alexa search index and get Million Search Results (MSR) back as output. Developers can run queries that return up to 10 million results. The resulting set, which represents a small subset of all the documents on the web, can then be processed further using a regular expression language. This allows developers to filter their search results using criteria that are not indexed by Alexa (Alexa indexes documents based on fifty different document attributes) thereby giving the developer power to do more sophisticated searches. Developers can run regular expressions against the actual documents, even when there are millions of them, to search for patterns and retrieve the subset of documents that matched that regular expression. This application is currently in production at Amazon.com and is code-named GrepTheWeb because it can “grep” (a popular Unix command-line utility to search patterns) the actual web documents. GrepTheWeb allows developers to do some pretty specialized searches like selecting documents that have a particular HTML tag or META tag or finding documents with particular punctuations (“Hey!”, he said. “Why Wait?”), or searching for mathematical equations (“f(x) = ∑x + W”), source code, e-mail addresses or other patterns such as “(dis)integration of life”. While the functionality is impressive, for us the way it was built is even more so. In the next section, we will zoom in to see different levels of the architecture of GrepTheWeb. Figure 1 shows a high-level depiction of the architecture. The output of the Million Search Results Service, which is a sorted list of links and gzipped (compressed using the Unix gzip utility) in a single file, is given to GrepTheWeb as input. It takes a regular expression as a second input. It then returns a filtered subset of document links sorted and gzipped into a single file. Since the overall process is asynchronous, developers can get the status of their jobs by calling GetStatus() to see whether the execution is completed. Performing a regular expression against millions of documents is not trivial. Different factors could combine to cause the processing to take lot of time: • Regular expressions could be complex • Dataset could be large, even hundreds of terabytes • Unknown request patterns, e.g., any number of people can access the application at any given point in time Hence, the design goals of GrepTheWeb included to scale in all dimensions (more powerful pattern-matching languages, more concurrent users of common datasets, larger datasets, better result qualities) while keeping the costs of processing down. The approach was to build an application that not only scales with demand, but also without a heavy upfront investment and without the cost of maintaining idle machines (“downbottom”). To get a response in a reasonable amount of time, it was important to distribute the job into multiple tasks and to perform a Distributed Grep operation that runs those tasks on multiple nodes in parallel. GrepTheWeb Application RegEx Subset of document URLs that matched the RegEx Input dataset (List of Document Urls) GetStatus Figure 1 : GrepTheWeb Architecture - Zoom Level 1 Amazon Web Services Zooming in further, GrepTheWeb architecture looks like as shown in Figure 2 (above). It uses the following AWS components: • Amazon S3 for retrieving input datasets and for storing the output dataset • Amazon SQS for durably buffering requests acting as a “glue” between controllers • Amazon SimpleDB for storing intermediate status, log, and for user data about tasks • Amazon EC2 for running a large distributed processing Hadoop cluster on-demand • Hadoop for distributed processing, automatic parallelization, and job scheduling Workflow GrepTheWeb is modular. It does its processing in four phases as shown in figure 3. The launch phase is responsible for validating and initiating the processing of a GrepTheWeb request, instantiating Amazon EC2 instances, launching the Hadoop cluster on them and starting all the job processes. The monitor phase is responsible for monitoring the EC2 cluster, maps, reduces, and checking for success and failure. The shutdown phase is responsible for billing and shutting down all Hadoop processes and Amazon EC2 instances, while the cleanup phase deletes Amazon SimpleDB transient data. Detailed Workflow for Figure 4: 1. On application start, queues are created if not already created and all the controller threads are started. Each controller thread starts polling their respective queues for any messages. 2. When a StartGrep user request is received, a launch message is enqueued in the launch queue. 3. Launch phase: The launch controller thread picks up the launch message, and executes the launch task, updates the status and timestamps in the Amazon SimpleDB domain, enqueues a new message in the monitor queue and deletes the message from the launch queue after processing. a. The launch task starts Amazon EC2 instances using a JRE pre-installed AMI , deploys required Hadoop libraries Launch Phase Monitor Phase Shutdown Phase Cleanup Phase Amazon SQS Controller Amazon EC2 Cluster Amazon S3 Amazon SimpleDB User info, Job status info Launch, Monitor, Shutdown Input Output Manage phases StartGrep RegEx GetStatus Input Files (Alexa Crawl) Get Output Figure 3: Phases of GrepTheWeb Architecture Figure 2: GrepTheWeb Architecture - Zoom Level 2 Amazon Web Services and starts a Hadoop Job (run Map/Reduce tasks). b. Hadoop runs map tasks on Amazon EC2 slave nodes in parallel. Each map task takes files (multithreaded in background) from Amazon S3, runs a regular expression (Queue Message Attribute) against the file from Amazon S3 and writes the match results along with a description of up to 5 matches locally and then the combine/reduce task combines and sorts the results and consolidates the output. c. The final results are stored on Amazon S3 in the output bucket 4. Monitor phase: The monitor controller thread picks up this message, validates the status/error in Amazon SimpleDB and executes the monitor task, updates the status in the Amazon SimpleDB domain, enqueues a new message in the shutdown queue and billing queue and deletes the message from monitor queue after processing. a. The monitor task checks for the Hadoop status (JobTracker success/failure) in regular intervals, updates the SimpleDB items with status/error and Amazon S3 output file. 5. Shutdown phase: The shutdown controller thread picks up this message from the shutdown queue, and executes the shutdown task, updates the status and timestamps in Amazon SimpleDB domain, deletes the message from the shutdown queue after processing. a. The shutdown task kills the Hadoop processes, terminates the EC2 instances after getting EC2 topology information from Amazon SimpleDB and disposes of the infrastructure. b. The billing task gets EC2 topology information, Simple DB Box Usage, Amazon S3 file and query input and calculates the billing and passes it to the billing service. 6. Cleanup phase: Archives the SimpleDB data with user info. 7. Users can execute GetStatus on the service endpoint to get the status of the overall system (all controllers and Hadoop) and download the filtered results from Amazon S3 after completion. Amazon SimpleDB Amazon SQS Controller Amazon S3 Master M Slaves N HDFS Hadoop Cluster on Amazon EC2 Launch Queue Monitor Queue Launch Controller Shutdown Queue Monitor Controller Billing Queue Shutdown Controller Status DB Output Billing Service Billing Controller Launch Ping Shutdown Insert JobID, Status Insert Amazon EC2 info Get EC2 Info Put File Input Get File Check for results Figure 4: GrepTheWeb Architecture - Zoom Level 3 Amazon Web Services The Use of Amazon Web Services In the next four subsections we present rationales of use and describe how GrepTheWeb uses AWS services. How Was Amazon S3 Used In GrepTheWeb, Amazon S3 acts as an input as well as an output data store. The input to GrepTheWeb is the web itself (compressed form of Alexa’s Web Crawl), stored on Amazon S3 as objects and updated frequently. Because the web crawl dataset can be huge (usually in terabytes) and always growing, there was a need for a distributed, bottomless persistent storage. Amazon S3 proved to be a perfect fit. How Was Amazon SQS Used Amazon SQS was used as message-passing mechanism between components. It acts as “glue” that wired different functional components together. This not only helped in making the different components loosely coupled, but also helped in building an overall more failure resilient system. Buffer If one component is receiving and processing requests faster than other components (an unbalanced producer consumer situation), buffering will help make the overall system more resilient to bursts of traffic (or load). Amazon SQS acts as a transient buffer between two components (controllers) of the GrepTheWeb system. If a message is sent directly to a component, the receiver will need to consume it at a rate dictated by the sender. For example, if the billing system was slow or if the launch time of the Hadoop cluster was more than expected, the overall system would slow down, as it would just have to wait. With message queues, sender and receiver are decoupled and the queue service smoothens out any “spiky” message traffic. Isolation Interaction between any two controllers in GrepTheWeb is through messages in the queue and no controller directly calls any other controller. All communication and interaction happens by storing messages in the queue (en-queue) and retrieving messages from the queue (de- queue). This makes the entire system loosely coupled and the interfaces simple and clean. Amazon SQS provided a uniform way of transferring information between the different application components. Each controller’s function is to retrieve the message, process the message (execute the function) and store the message in other queue while they are completely isolated from others. Asynchrony As it was difficult to know how much time each phase would take to execute (e.g., the launch phase decides dynamically how many instances need to start based on the request and hence execution time is unknown) Amazon SQS helped in building asynchronous systems. Now, if the launch phase takes more time to process or the monitor phase fails, the other components of the system are not affected and the overall system is more stable and highly available. How Was Amazon SimpleDB Used One use for a database in Cloud Architectures is to track statuses. Since the components of the system are asynchronous, there is a need to obtain the status of the system at any given point in time. Moreover, since all components are autonomous and discrete there is a need for a query-able datastore that captures the state of the system. Because Amazon SimpleDB is schema-less, there is no need to define the structure of a record beforehand. Every controller can define its own structure and append data to a “job” item. For example: For a given job, “run email address regex over 10 million documents”, the launch controller will add/update the ”launch_status” attribute along with the ”launch_starttime”, while the monitor controller will add/update the “monitor_status” and ”hadoop_status” attributes with enumeration values (running, completed, error, none). A GetStatus() call will query Amazon SimpleDB and return the state of each controller and also the overall status of the system. Component services can query Amazon SimpleDB anytime because controllers
/
本文档为【amazon云计算介绍】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索