Week1 &2: Preparing for data Analysis
It is important when you do statistical analysis that you find that part of the data which is dense. If you do not have a large number of records for a value that want to calculate such as variance or expectation you would get a value that is unreliable. DataFrames are more flexible than RDD as you can add new columns to them. For Azure HD Insights the actual HDFS storage is at the Storage account associated with the cluster at creation time. So you can directly read a parquet file from Storage account container into the cluster as below:- parquet_name='wasb:///weather/weather' query="""SELECT station,measurement,year FROM parquet.`%s.parquet` WHERE measurement=\"PRCP\" """%parquet_name print(query) df2 = sqlContext.sql(query) #print 'number of rows=',df2.count() df2.show(5) to_Pandas is not longer correct you should use toPandas() to convert a spark data frame to a pandas dataframe We convert a spark data frame to a Pandas dat...