Week1 &2: Spark Notebook Basics

 Spark context is a way of communicating with the spark system.

You can only have spark context per system as spark is designed as a single user system.

to stop a spark context sc.stop()

RDD(Resilient Distributed DataSet) is the novel main data structure in spark. You can think of it as a list whose elements are stored in different computers and they have a component which is residing in the driver node.

Once you take something and define it as an RDD it is actually going to take it longer for you to be able to read it as it is no longer available locally and it will cost you to bring this data back.

The simplest way to create an RDD is to take a list and then call sc.parallelize.

Collect is the reverse of parallelize and will collect the elements of the RDD in head node.

Collect eliminates the benefits of parallelism.

Map- Applies the given operation to each element of the RDD

Reduce- Map the RDD to a single value using a given operation.

Usaually the recommendation is one worker per core. But you can have more than one workers per core. More than one worker per core is usually unhelpful.

Execution Plans Lazy Eval and Caching- If you want to optimize time and use of memory of your program

Busy Evaluation: FOr calculating the sum of squares we need to square each of the elements and then sum it up this would require storing intermediate results.

Lazy Evaluation- postpone computing the square until the result is needed. No need to store intermediate results. Scan through the data only once rather than twice.

Execution Plane: AT this point the variable interim does not point to an actual data structure but to an execution plan also referred to as dependence graph. The dependence graph defines how the RDD are computed from each other. The dependence graph associated with a RDD can be printed using the method toDebugString.






Spark uses lazy evaluation to save both time and space

When the same RDD is needed as input for several computations it can be better to keep it in memory also called cache()


Partitions and Gloming
Related to efficiency and runtime
When you create an RDD you can define the number of partitions 
the default is the number of executors you specify as part of spark context.
You need at least partitions as you have the number of executors
When we filter the RDD we might have some partitions having more elements than others this would cause some executors to go idle. One way to solve this is to re-partitions using a new key.



GLOM: In general spark does not allow workers to refer to specific elements of the RDD. Keeps the language clean but can be a major limitation.

Glom() is a command that takes each partition of the RDD and transforms it into a tuple. Creates an RDD of tuples. One tuple per partition. Workers can refer to the elements of the partition by index. But you cannot assign values to elements. The RDD is immutable. 

Actions and transformations on RDD
Chaining- Chaining is a way that you concatenate actions and transformations in spark.











Functions on RDD













Word count is classical use case to show how map reduce works.
RDD is distributed immutable array. You cannot operate on RDD directly only through transformations and actions.
Trasnformations transform the RDD into another RDD
actions output their results to the head node
so after action you are just reduced to head node
RDD operations are added to an execution  plan. The plan is executed when a result is needed. Explicit and implicit caching causes intermediate results to be saved. 



Collect converts the RDD to a list


There are three types of spark operations- transformations, actions and shuffles. They differ mostly in how much communication they require in between the nodes.
The most simple one is transformations of RDD to RDD which consists of map, filter and require no or very little communication between nodes.
Actions require more communication because they generate a result in the head node. Eg reduce, collect, count, take
RDD to RDD shuffle- although they are also transformations but these are transformations which require a lot of communication between nodes. Eg sort, distinct, repartion, sortbyKey, reduceByKey, join. They require a lot of communication between the nodes and therefore they are slower. 






















Lazy Evaluation

Unlike a regular python program, map/reduce commands do not always perform any computation when they are executed. Instead, they construct something called an execution plan. Only when a result is needed does the computation start. This approach is also called lazy execution.

The benefit from lazy execution is in minimizing the the number of memory accesses. Consider for example the following map/reduce commands:

A=RDD.map(lambda x:x*x).filter(lambda x: x%2==0)
A.reduce(lambda x,y:x+y)

The commands defines the following plan. For each number x in the RDD:

  1. Compute the square of x
  2. Filter out x*x whose value is odd.
  3. Sum the elements that were not filtered out.

A naive execution plan is to square all items in the RDD, store the results in a new RDD, then perform a filtering pass, generating a second RDD, and finally perform the summation. Doing this will require iterating through the RDD three times, and creating 2 interim RDDs. As memory access is the bottleneck in this type of computation, the execution plan is slow.

A better execution plan is to perform all three operations on each element of the RDD in sequence, and then move to the next element. This plan is faster because we iterate through the elements of the RDD only once, and because we don't need to save the intermediate results. We need to maintain only one variable: the partial sum, and as that is a single variable, we can use a CPU register.

For more on RDDs and lazy evaluation see here in the spark manual

Lazy Execution refers to this type of behaviour. The system delays actual computation until the latest possible moment. Instead of computing the content of the RDD, it adds the RDD to the execution plan.

Using Lazy evaluation of a plan has two main advantages relative to immediate execution of each step:

  1. A single pass over the data, rather than multiple passes.
  2. Smaller memory footprint becase no intermediate results are saved.

Partitioning and Gloming

Cmd 55
  • When an RDD is created, you can specify the number of partitions.
  • The default is the number of workers de

  • fined when you set up SparkContext

Glom()

  • In general, spark does not allow the worker to refer to specific elements of the RDD.
  • Keeps the language clean, but can be a major limitation.
Cmd 74
  • glom() transforms each partition into a tuple (immutabe list) of elements.
  • Creates an RDD of tules. One tuple per partition.
  • workers can refer to elements of the partition by index.
  • but you cannot assign values to the elements, the RDD is still immutable.

Comments