Week1 &2: Spark Notebook Basics
Spark context is a way of communicating with the spark system.
You can only have spark context per system as spark is designed as a single user system.
to stop a spark context sc.stop()
RDD(Resilient Distributed DataSet) is the novel main data structure in spark. You can think of it as a list whose elements are stored in different computers and they have a component which is residing in the driver node.
Once you take something and define it as an RDD it is actually going to take it longer for you to be able to read it as it is no longer available locally and it will cost you to bring this data back.
The simplest way to create an RDD is to take a list and then call sc.parallelize.
Collect is the reverse of parallelize and will collect the elements of the RDD in head node.
Collect eliminates the benefits of parallelism.
Map- Applies the given operation to each element of the RDD
Reduce- Map the RDD to a single value using a given operation.
Usaually the recommendation is one worker per core. But you can have more than one workers per core. More than one worker per core is usually unhelpful.
Execution Plans Lazy Eval and Caching- If you want to optimize time and use of memory of your program
Busy Evaluation: FOr calculating the sum of squares we need to square each of the elements and then sum it up this would require storing intermediate results.
Lazy Evaluation- postpone computing the square until the result is needed. No need to store intermediate results. Scan through the data only once rather than twice.
Execution Plane: AT this point the variable interim does not point to an actual data structure but to an execution plan also referred to as dependence graph. The dependence graph defines how the RDD are computed from each other. The dependence graph associated with a RDD can be printed using the method toDebugString.
Spark uses lazy evaluation to save both time and space
Lazy Evaluation
Unlike a regular python program, map/reduce commands do not always perform any computation when they are executed. Instead, they construct something called an execution plan. Only when a result is needed does the computation start. This approach is also called lazy execution.
The benefit from lazy execution is in minimizing the the number of memory accesses. Consider for example the following map/reduce commands:
A=RDD.map(lambda x:x*x).filter(lambda x: x%2==0)
A.reduce(lambda x,y:x+y)
The commands defines the following plan. For each number x in the RDD:
- Compute the square of
x - Filter out
x*xwhose value is odd. - Sum the elements that were not filtered out.
A naive execution plan is to square all items in the RDD, store the results in a new RDD, then perform a filtering pass, generating a second RDD, and finally perform the summation. Doing this will require iterating through the RDD three times, and creating 2 interim RDDs. As memory access is the bottleneck in this type of computation, the execution plan is slow.
A better execution plan is to perform all three operations on each element of the RDD in sequence, and then move to the next element. This plan is faster because we iterate through the elements of the RDD only once, and because we don't need to save the intermediate results. We need to maintain only one variable: the partial sum, and as that is a single variable, we can use a CPU register.
For more on RDDs and lazy evaluation see here in the spark manual
Lazy Execution refers to this type of behaviour. The system delays actual computation until the latest possible moment. Instead of computing the content of the RDD, it adds the RDD to the execution plan.
Using Lazy evaluation of a plan has two main advantages relative to immediate execution of each step:
- A single pass over the data, rather than multiple passes.
- Smaller memory footprint becase no intermediate results are saved.





























Comments
Post a Comment