UC San Diego Big Data Analytics with Spark

Posts

Showing posts from September, 2021

Week1 &2: Spark Notebook Basics

September 30, 2021

Spark context is a way of communicating with the spark system. You can only have spark context per system as spark is designed as a single user system. to stop a spark context sc.stop() RDD(Resilient Distributed DataSet) is the novel main data structure in spark. You can think of it as a list whose elements are stored in different computers and they have a component which is residing in the driver node. Once you take something and define it as an RDD it is actually going to take it longer for you to be able to read it as it is no longer available locally and it will cost you to bring this data back. The simplest way to create an RDD is to take a list and then call sc.parallelize. Collect is the reverse of parallelize and will collect the elements of the RDD in head node. Collect eliminates the benefits of parallelism. Map- Applies the given operation to each element of the RDD Reduce- Map the RDD to a single value using a given operation. Usaually the recommendation is one worker ...

Week1 &2: Spark Basics

September 27, 2021

It is important that the result of map and reduce does not depend on the order of the input. Map reduce forms the basis for both Hadoop and spark. Spark has been programmed in a Java like language called Scala. Pyspark is a python library for programming in spark. It does not always achieve the same kind of efficiency as Scala. Spark Context The first important object that we use in spark is spark context. It is a piece that connects our software that sits on the controller node to the whole Infra or nodes in a spark cluster. The control of other nodes is achieved through this. A notebook can have only one spark context Resilient Distributed DataSets (RDD) Abstraction that defines the data distributed across nodes in the cluster. A list whose elements are distributed over several computers. It is the main data structure in spark. When the data is in this RDD form the elements in the list are manipulated using RDD commands. RDD can be created from a list on master or from files. Fi...

Week1 &2: The memory hierarchy

September 26, 2021

Something is happenning around the world. The amount of data being collected is increasingly enormously. Now you can increase computer memory to process that data 1GB, 10GB, 128 GB but at some point it runs out. Hence we set up compute clusters for parallel data processing. Load analyze and mode complex data and make prediction using pyspark. Storage Latency What is it that makes computation on very large data sets slow. Every computer has two parts- CPU and memory.In Big data applications the storage latency (amount of time taken to read and write to the disks is much larger than computational latency). Much of the big data optimization tries to look to arrange CPU and memory(storage) in such a way as to minimize latency and costs. With a given amount of money we can buy memory which is fast and small or large and slow. We need to find a memory which is fast and large in the same budget. That is where cache memory comes into play Cache hit- CPU wantes to read a memory location wh...