Week1 &2: Preparing for data Analysis

It is important when you do statistical analysis that you find that part of the data which is dense.

If you do not have a large number of records for a value that want to calculate such as variance or expectation you would get a value that is unreliable.

DataFrames are more flexible than RDD as you can add new columns to them.

For Azure HD Insights the actual HDFS storage is at the Storage account associated with the cluster at creation time. So you can directly read a parquet file from Storage account container into the cluster as below:-

parquet_name='wasb:///weather/weather'

query="""SELECT station,measurement,year

FROM parquet.`%s.parquet`

WHERE measurement=\"PRCP\" """%parquet_name

print(query)

df2 = sqlContext.sql(query)

#print 'number of rows=',df2.count()

df2.show(5)

to_Pandas is not longer correct you should use toPandas() to convert a spark data frame to a pandas dataframe

We convert a spark data frame to a Pandas data frame which is easier to work with and resides in the head node.

Moving and De-Serialization

Data stored on disk is serial. Data stored in data structures is de-serialized. In memory data is stored in data structures.

Search This Blog

UC San Diego Big Data Analytics with Spark

Week1 &2: Preparing for data Analysis

Comments

Post a Comment