Week1 &2: Preparing for data Analysis

 It is important when you do statistical analysis that you find that part of the data which is dense.

If you do not have a large number of records for a value that want to calculate such as variance or expectation you would get a value that is unreliable.

DataFrames are more flexible than RDD as you can add new columns to them.

For Azure HD Insights the actual HDFS storage is at the Storage account associated with the cluster at creation time. So you can directly read a parquet file from Storage account container into the cluster as below:-

parquet_name='wasb:///weather/weather'
query="""SELECT station,measurement,year 
FROM parquet.`%s.parquet` 
WHERE measurement=\"PRCP\" """%parquet_name
print(query)
df2 = sqlContext.sql(query)
#print 'number of rows=',df2.count()
df2.show(5)

to_Pandas is not longer correct you should use toPandas() to convert a spark data frame to a pandas dataframe
We convert a spark data frame to a Pandas data frame which is easier to work with and resides in the head node. 


Moving and De-Serialization
Data stored on disk is serial. Data stored in data structures is de-serialized. In memory data is stored in data structures.

Comments