Week1 &2: The memory hierarchy
Something is happenning around the world. The amount of data being collected is increasingly enormously. Now you can increase computer memory to process that data 1GB, 10GB, 128 GB but at some point it runs out. Hence we set up compute clusters for parallel data processing.
Load analyze and mode complex data and make prediction using pyspark.
Storage Latency
What is it that makes computation on very large data sets slow. Every computer has two parts- CPU and memory.In Big data applications the storage latency (amount of time taken to read and write to the disks is much larger than computational latency).
Much of the big data optimization tries to look to arrange CPU and memory(storage) in such a way as to minimize latency and costs.
With a given amount of money we can buy memory which is fast and small or large and slow.
We need to find a memory which is fast and large in the same budget.
That is where cache memory comes into play
Cache hit- CPU wantes to read a memory location which is present in the cache
cache miss- CPU wants to read a memory location not present in the cache.
temporal locality occurs when the program accesses the same memory location multiple times in quick succession.
Spatial Locality occurs when the program tries to read memory locations which are close to each other
For caching to be effective we need to either have temporal locality or spatial locality.
Access locality means ability of the software to make good use of the cache.
Memory is broken into pages and software that uses the same or neighboring pages repeatedly is said to have good access locality.
A linked list has a poor spatial locality and an indexed array has a good spatial locality.
This means that if you scan an array row by row you would have more local access and it would be faster as compared to column by column access.
Max clockrate currently is 3GHZ and that is not going to change in near future
So instead of compute we need to move the data around for parallel and faster processing.
The memory hierarchy
Small and fast memory is nearest to the CPU
Large and slow memory is farthest from the CPU
In distributed data processing we need a mechanism of transferring data between computers and preferably which is hidden from programmers. In spark it is known as Resilient Distributed DataSets(RDD).
HDFS is storage abstraction
Map Reduce is compute abstraction that works well with HDFS
Hadoop uses disk to distribute data. Spark uses memory to distribute data and hence is fatser. Spark uses in memory processing





Comments
Post a Comment