Week1 &2: Spark Notebook Basics
Spark context is a way of communicating with the spark system. You can only have spark context per system as spark is designed as a single user system. to stop a spark context sc.stop() RDD(Resilient Distributed DataSet) is the novel main data structure in spark. You can think of it as a list whose elements are stored in different computers and they have a component which is residing in the driver node. Once you take something and define it as an RDD it is actually going to take it longer for you to be able to read it as it is no longer available locally and it will cost you to bring this data back. The simplest way to create an RDD is to take a list and then call sc.parallelize. Collect is the reverse of parallelize and will collect the elements of the RDD in head node. Collect eliminates the benefits of parallelism. Map- Applies the given operation to each element of the RDD Reduce- Map the RDD to a single value using a given operation. Usaually the recommendation is one worker ...