Без темы
<<  Making PDFs Accessible MARK 5342 Advanced Topics  >>
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
CONTENT
CONTENT
CONTENT
CONTENT
1.Introduction
1.Introduction
1.Introduction
1.Introduction
2.Programming Model
2.Programming Model
2.Programming Model
2.Programming Model
2.1 Example
2.1 Example
2.1 Example
2.1 Example
2.1 Example
2.1 Example
2.1 Example
2.1 Example
2.2 More Examples
2.2 More Examples
2.1 Example
2.1 Example
3. Implementation
3. Implementation
3. Implementation
3. Implementation
3. Implementation
3. Implementation
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.1 Execution Overview
3.2 Master Data Structures
3.2 Master Data Structures
3.3 Fault Tolerance
3.3 Fault Tolerance
3.3 Fault Tolerance
3.3 Fault Tolerance
3.4 Backup Tasks
3.4 Backup Tasks
4. Refinements
4. Refinements
4.1 Partitioning Function
4.1 Partitioning Function
4.2 Combiner Function
4.2 Combiner Function
4.3 Input and Output Types
4.3 Input and Output Types
4.4 Side-effects
4.4 Side-effects
4.5 Skipping Bad Records
4.5 Skipping Bad Records
5. Experience
5. Experience
5. Experience
5. Experience
6. Conclusions
6. Conclusions
6. Conclusions
6. Conclusions
THANKS FOR LISTENING
THANKS FOR LISTENING

Презентация на тему: «MapReduce: Simplified Data Processing on Large Clusters». Автор: Fatih. Файл: «MapReduce: Simplified Data Processing on Large Clusters.pptx». Размер zip-архива: 187 КБ.

MapReduce: Simplified Data Processing on Large Clusters

содержание презентации «MapReduce: Simplified Data Processing on Large Clusters.pptx»
СлайдТекст
1 MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

S?leyman Fatih G?R?? 500512009

2 CONTENT

CONTENT

1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2 Master Data Structures 3.3 Fault Tolerance 3.4 Backup Tasks

3 CONTENT

CONTENT

4. Refinements 4.1 Partitioning Function 4.2 Combiner Function 4.3 Input and Output Types 4.4 Side-effects 4.5 Skipping Bad Records 5. Experience 6. Conclusions

4 1.Introduction

1.Introduction

MapReduce is a programming model and an associated implementation for processing and generating large data sets. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

5 1.Introduction

1.Introduction

Inspired by the map and reduce primitives present in Lisp and many other functional languages. Map()--applies a given function to each element of a list Reduce()--recombine through use of a given combining operation Enables automatic parallelization and distribution of large-scale computations High performance on large clusters

6 2.Programming Model

2.Programming Model

Takes a set of input key/value pairs, and produces a set of output key/value pairs Expresses the computation w/two functions : Map and Reduce Map, takes an input pair and produces a set of intermediate key/value pairs Reduce,merges together values ,was sent from Map, to form a possibly smaller set of values

7 2.Programming Model

2.Programming Model

8 2.1 Example

2.1 Example

There are fruits and number of sold ones are given. Apple 9 Banana 10 Apple 2 Strawberry 5 Strawberry 9 Apple 4 Strawberry 3 Banana 8

9 2.1 Example

2.1 Example

We split them into 2 pieces. Apple 9 Banana 10 Apple 2 Strawberry 5 --------------------------------------------------------------- Strawberry 9 Apple 4 Strawberry 3 Banana 8

10 2.1 Example

2.1 Example

Apple 11 Banana 10 Strawberry 5 --------------------------------------------------------------- Strawberry 12 Apple 4 Banana 8

11 2.1 Example

2.1 Example

Apple 11 Banana 10 Strawberry 5 --------------------------------------------------------------- Strawberry 12 Apple 4 Banana 8

Key-value pair

12 2.2 More Examples

2.2 More Examples

Word Count Distributed Grep Count of URL Access Frequency ReverseWeb-Link Graph Term-Vector per Host Inverted Index Distributed Sort

13 2.1 Example

2.1 Example

As a result we have: Apple 15 Banana 18 Strawberry 17

14 3. Implementation

3. Implementation

Many different implementations of the MapReduce interface are possible The right choice depends on the environment We will check mostly the implementation targeted to the computing environment in wide use at Google

15 3. Implementation

3. Implementation

(1) Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of memory per machine. (2) Commodity networking hardware is used – typically either 100 megabits/second or 1 gigabit/second at the machine level.

16 3. Implementation

3. Implementation

(3) A cluster consists of hundreds or thousands of machines, and therefore machine failures are common. (4) Storage is provided by inexpensive IDE disks attached directly to individual machines. (5) Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.

17 3.1 Execution Overview

3.1 Execution Overview

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function(e.g., hash(key) mod R)

18 3.1 Execution Overview

3.1 Execution Overview

1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece

19 3.1 Execution Overview

3.1 Execution Overview

2. One of the copies of the program is special – the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

20 3.1 Execution Overview

3.1 Execution Overview

3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function.

21 3.1 Execution Overview

3.1 Execution Overview

4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master.

22 3.1 Execution Overview

3.1 Execution Overview

5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.

23 3.1 Execution Overview

3.1 Execution Overview

6. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.

24 3.1 Execution Overview

3.1 Execution Overview

7. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function.

25 3.1 Execution Overview

3.1 Execution Overview

8. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

26 3.2 Master Data Structures

3.2 Master Data Structures

For each map task and reduce task, it stores : State (idle, in-progress,or completed) Identity of the worker machine(for non-idle tasks). For each completed map task, the master stores the locations and sizes of the R intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed.

27 3.3 Fault Tolerance

3.3 Fault Tolerance

Worker Failure If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state. Completed map tasks are re-executed on a failure. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.

28 3.3 Fault Tolerance

3.3 Fault Tolerance

Master Failure Writes periodic checkpoints of data structures If the master task dies, a new copy can be started from the last checkpointed state Its failure is unlikely; therefore current implementation aborts the MapReduce computation if the master fails

29 3.4 Backup Tasks

3.4 Backup Tasks

One of the reason causes that lengthens the total time taken for a MapReduce operation is Straggler :A machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. Solution: When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks The task is marked as completed whenever either the primary or the backup execution completes.

30 4. Refinements

4. Refinements

Although the basic functionality provided by simply writing Map and Reduce functions is sufficient for most needs, there are a few extensions useful.

31 4.1 Partitioning Function

4.1 Partitioning Function

The users specify the number of reduce tasks/output files that they desire (R). Data gets partitioned across these tasks using a partitioning function on the intermediate key. A default partitioning function is provided that uses hashing (e.g. “hash(key) mod R”)

32 4.2 Combiner Function

4.2 Combiner Function

All of counts will be sent over the network to a single reduce task and then added together by the Reduce function to produce one number. It is allowed the user to specify an optional Combiner function that does partial merging of this data before it is sent over the network. The Combiner function is executed on each machine that performs a map task

33 4.3 Input and Output Types

4.3 Input and Output Types

The MapReduce library provides support for reading input data in several different formats For example, “text” mode input treats each line as a key/value pair: Key is the offset in the file and the value is the contents of the line. Another common supported format stores a sequence of key/value pairs sorted by key.

34 4.4 Side-effects

4.4 Side-effects

In some cases , users of MapReduce have found it convenient to produce auxiliary files as additional outputs from their map and/or reduce operators Typically the application writes to a temporary file and atomically renames this file once it has been fully generated.

35 4.5 Skipping Bad Records

4.5 Skipping Bad Records

There are bugs in user code that cause the Map or Reduce functions to crash deterministically on certain records. Sometimes it is acceptable to ignore a few records(e.g when doing statistical analysis on a large data set) MapReduce library detects which records cause deterministic crashes and skips these records in order to make forward progress.

36 5. Experience

5. Experience

It has been used across a wide range of domains within Google, including: Large-scale machine learning problems Clustering problems for the Google News and Froogle products Extraction of data used to produce reports of popular queries (e.g. Google Zeitgeist) Extraction of properties of web pages for new experiments and products (e.g. extraction of geographical locations from a large corpus of web pages for localized search) Large-scale graph computations

37 5. Experience

5. Experience

38 6. Conclusions

6. Conclusions

The model is easy to use, even for programmers without experience with parallel and distributed systems A large variety of problems are easily expressible as MapReduce computations Scales to large clusters of machines comprising thousands of machines

39 6. Conclusions

6. Conclusions

Restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant Network bandwidth is a limitted resource; A number of optimizations in this system are reducing the amount of data sent across the network Redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.

40 THANKS FOR LISTENING

THANKS FOR LISTENING

«MapReduce: Simplified Data Processing on Large Clusters»
http://900igr.net/prezentacija/anglijskij-jazyk/mapreduce-simplified-data-processing-on-large-clusters-149850.html
cсылка на страницу

Без темы

661 презентация
Урок

Английский язык

29 тем
Слайды
900igr.net > Презентации по английскому языку > Без темы > MapReduce: Simplified Data Processing on Large Clusters