Posts

Showing posts from October, 2021

Week1 &2: Preparing for data Analysis

 It is important when you do statistical analysis that you find that part of the data which is dense. If you do not have a large number of records for a value that want to calculate such as variance or expectation you would get a value that is unreliable. DataFrames are more flexible than RDD as you can add new columns to them. For Azure HD Insights the actual HDFS storage is at the Storage account associated with the cluster at creation time. So you can directly read a parquet file from Storage account container into the cluster as below:- parquet_name='wasb:///weather/weather' query="""SELECT station,measurement,year  FROM parquet.`%s.parquet`  WHERE measurement=\"PRCP\" """%parquet_name print(query) df2 = sqlContext.sql(query) #print 'number of rows=',df2.count() df2.show(5) to_Pandas is not longer correct you should use toPandas() to convert a spark data frame to a pandas dataframe We convert a spark data frame to a Pandas dat...

Week1 &2: SparkSQLand DataFrames

Image
 DataFrames are special type of RDD's. DataFrames have two dimensional data like spreadsheet and they have rows and they have columns. SQL context is like SparkContext and lies underneath sparkContext and handles data frames. df.printSchema defines the schema of the dataFram which is name and type Take a RDD, transform it into a dataframe and then load it into a parquet file. Variants of join. There are four variants of  join  which differ in how they treat keys that appear in one dataset but not the other. join  is an  inner  join which means that keys that appear only in one dataset are eliminated. leftOuterJoin  keeps all keys from the left dataset even if they don't appear in the right dataset. The result of leftOuterJoin in our example will contain the keys  John, Jill, Kate rightOuterJoin  keeps all keys from the right dataset even if they don't appear in the left dataset. The result of leftOuterJoin in our example will contain the keys...