Week1 &2: Spark Basics
It is important that the result of map and reduce does not depend on the order of the input.
Map reduce forms the basis for both Hadoop and spark.
Spark has been programmed in a Java like language called Scala.
Pyspark is a python library for programming in spark. It does not always achieve the same kind of efficiency as Scala.
Spark Context
The first important object that we use in spark is spark context. It is a piece that connects our software that sits on the controller node to the whole Infra or nodes in a spark cluster. The control of other nodes is achieved through this. A notebook can have only one spark context
Resilient Distributed DataSets (RDD)
Abstraction that defines the data distributed across nodes in the cluster. A list whose elements are distributed over several computers. It is the main data structure in spark. When the data is in this RDD form the elements in the list are manipulated using RDD commands. RDD can be created from a list on master or from files. File can be distributed across nodes in HDFS format. RDD can be collected back in a local list using the command list.
Spark Architecture
workers manage the partitions and executors
Both map() and mapvalues() are used to transform the elements inside the RDD
Properties of reduce operations
Reduce operations must not depend on the order
- Order of operands should not matter
- Order of application of reduce operator should not matter
Multiplication and summation are good:
Why must reordering not change the result?
You can think about the reduce operation as a binary tree where the leaves are the elements of the list and the root is the final result. Each triplet of the form (parent, child1, child2) corresponds to a single application of the reduce function.
The order in which the reduce operation is applied is determined at run time and depends on how the RDD is partitioned across the cluster. There are many different orders to apply the reduce operation.
If we want the input RDD to uniquely determine the reduced value all evaluation orders must must yield the same final result. In addition, the order of the elements in the list must not change the result. In particular, reversing the order of the operands in a reduce function must not change the outcome.
For example the arithmetic operations multiply * and add + can be used in a reduce, but the operations subtract - and divide / should not.
Doing so will not raise an error, but the result is unpredictable.
Which of these the following orders was executed?
- or











Comments
Post a Comment