Week1 &2: Spark Notebook Basics

Spark context is a way of communicating with the spark system.

You can only have spark context per system as spark is designed as a single user system.

to stop a spark context sc.stop()

RDD(Resilient Distributed DataSet) is the novel main data structure in spark. You can think of it as a list whose elements are stored in different computers and they have a component which is residing in the driver node.

Once you take something and define it as an RDD it is actually going to take it longer for you to be able to read it as it is no longer available locally and it will cost you to bring this data back.

The simplest way to create an RDD is to take a list and then call sc.parallelize.

Collect is the reverse of parallelize and will collect the elements of the RDD in head node.

Collect eliminates the benefits of parallelism.

Map- Applies the given operation to each element of the RDD

Reduce- Map the RDD to a single value using a given operation.

Usaually the recommendation is one worker per core. But you can have more than one workers per core. More than one worker per core is usually unhelpful.

Execution Plans Lazy Eval and Caching- If you want to optimize time and use of memory of your program

Busy Evaluation: FOr calculating the sum of squares we need to square each of the elements and then sum it up this would require storing intermediate results.

Lazy Evaluation- postpone computing the square until the result is needed. No need to store intermediate results. Scan through the data only once rather than twice.

Execution Plane: AT this point the variable interim does not point to an actual data structure but to an execution plan also referred to as dependence graph. The dependence graph defines how the RDD are computed from each other. The dependence graph associated with a RDD can be printed using the method toDebugString.

Spark uses lazy evaluation to save both time and space

When the same RDD is needed as input for several computations it can be better to keep it in memory also called cache()

Partitions and Gloming

Related to efficiency and runtime

When you create an RDD you can define the number of partitions

the default is the number of executors you specify as part of spark context.

You need at least partitions as you have the number of executors

When we filter the RDD we might have some partitions having more elements than others this would cause some executors to go idle. One way to solve this is to re-partitions using a new key.

GLOM: In general spark does not allow workers to refer to specific elements of the RDD. Keeps the language clean but can be a major limitation.

Glom() is a command that takes each partition of the RDD and transforms it into a tuple. Creates an RDD of tuples. One tuple per partition. Workers can refer to the elements of the partition by index. But you cannot assign values to elements. The RDD is immutable.

Actions and transformations on RDD

Chaining- Chaining is a way that you concatenate actions and transformations in spark.

Functions on RDD

Word count is classical use case to show how map reduce works.

RDD is distributed immutable array. You cannot operate on RDD directly only through transformations and actions.

Trasnformations transform the RDD into another RDD

actions output their results to the head node

so after action you are just reduced to head node

RDD operations are added to an execution plan. The plan is executed when a result is needed. Explicit and implicit caching causes intermediate results to be saved.

Collect converts the RDD to a list

There are three types of spark operations- transformations, actions and shuffles. They differ mostly in how much communication they require in between the nodes.

The most simple one is transformations of RDD to RDD which consists of map, filter and require no or very little communication between nodes.

Actions require more communication because they generate a result in the head node. Eg reduce, collect, count, take

RDD to RDD shuffle- although they are also transformations but these are transformations which require a lot of communication between nodes. Eg sort, distinct, repartion, sortbyKey, reduceByKey, join. They require a lot of communication between the nodes and therefore they are slower.

Lazy Evaluation

Unlike a regular python program, map/reduce commands do not always perform any computation when they are executed. Instead, they construct something called an execution plan. Only when a result is needed does the computation start. This approach is also called lazy execution.

The benefit from lazy execution is in minimizing the the number of memory accesses. Consider for example the following map/reduce commands:

A=RDD.map(lambda x:x*x).filter(lambda x: x%2==0)
A.reduce(lambda x,y:x+y)

The commands defines the following plan. For each number x in the RDD:

Compute the square of x
Filter out x*x whose value is odd.
Sum the elements that were not filtered out.

A naive execution plan is to square all items in the RDD, store the results in a new RDD, then perform a filtering pass, generating a second RDD, and finally perform the summation. Doing this will require iterating through the RDD three times, and creating 2 interim RDDs. As memory access is the bottleneck in this type of computation, the execution plan is slow.

A better execution plan is to perform all three operations on each element of the RDD in sequence, and then move to the next element. This plan is faster because we iterate through the elements of the RDD only once, and because we don't need to save the intermediate results. We need to maintain only one variable: the partial sum, and as that is a single variable, we can use a CPU register.

For more on RDDs and lazy evaluation see here in the spark manual

Lazy Execution refers to this type of behaviour. The system delays actual computation until the latest possible moment. Instead of computing the content of the RDD, it adds the RDD to the execution plan.

Using Lazy evaluation of a plan has two main advantages relative to immediate execution of each step:

A single pass over the data, rather than multiple passes.
Smaller memory footprint becase no intermediate results are saved.

Cmd 55

When an RDD is created, you can specify the number of partitions.
The default is the number of workers de
fined when you set up SparkContext

Cmd 74

Search This Blog

UC San Diego Big Data Analytics with Spark

Week1 &2: Spark Notebook Basics

Lazy Evaluation

Partitioning and Gloming

Glom()

Comments

Post a Comment