Week1 &2: The memory hierarchy

 Something is happenning around the world. The amount of data being collected is increasingly enormously. Now you can increase computer memory to process that data 1GB, 10GB, 128 GB but at some point it runs out. Hence we set up compute clusters for parallel data processing.

Load analyze and mode complex data and make prediction using pyspark.


Storage Latency

What is it that makes computation on very large data sets slow. Every computer has two parts- CPU and memory.In Big data applications the storage latency (amount of time taken to read and write to the disks is much larger than computational latency).

Much of the big data optimization tries to look to arrange CPU and memory(storage) in such a way as to minimize latency and costs.

With a given amount of money we can buy memory which is fast and small or large and slow.

We need to find a memory which is fast and large in the same budget.

That is where cache memory comes into play


Cache hit- CPU wantes to read a memory location which is present in the cache

cache miss- CPU wants to read a memory location not present in the cache.

temporal locality occurs when the program accesses the same memory location multiple times in quick succession.

Spatial Locality occurs when the program tries to read memory locations which are close to each other

For caching to be effective we need to either have temporal locality or spatial locality.

Access locality means ability of the software to make good use of the cache.

Memory is broken into pages and software that uses the same or neighboring pages repeatedly is said to have good access locality. 

A linked list has a poor spatial locality and an indexed array has a good spatial locality. 

Numpy arrays by default are ordered in a row major order.



This means that if you scan an array row by row you would have more local access and it would be faster as compared to column by column access.
traversing a numpy array column by column takes more time as compared to row by row. The effect increases proportionally to the number of elements in the array. The effect is highly variable between runs.
I sum up elements of the same column across rows means I am traversing column by column 
I sum up all the elements of a particular row means I am traversing row by row



Memory latency is the time take from when CPU issues a read/write and when it is completed.
Long tail is when probability of getting extreme values is much higher that normal values.
L3 cache is very important for Big Data Analytical tasks

Max clockrate currently is 3GHZ and that is not going to change in near future

So instead of compute we need to move the data around for parallel and faster processing.


The memory hierarchy

Small and fast memory is nearest to the CPU

Large and slow memory is farthest from the CPU


In distributed data processing we need a mechanism of transferring data between computers and preferably which is hidden from programmers. In spark it is known as Resilient Distributed DataSets(RDD).

HDFS is storage abstraction

Map Reduce is compute abstraction that works well with HDFS

Hadoop uses disk to distribute data. Spark uses memory to distribute data and hence is fatser. Spark uses in memory processing


Comments