Week1 &2: SparkSQLand DataFrames

 DataFrames are special type of RDD's. DataFrames have two dimensional data like spreadsheet and they have rows and they have columns.

SQL context is like SparkContext and lies underneath sparkContext and handles data frames.

df.printSchema defines the schema of the dataFram which is name and type


































Take a RDD, transform it into a dataframe and then load it into a parquet file.

Variants of join.

There are four variants of join which differ in how they treat keys that appear in one dataset but not the other.

  • join is an inner join which means that keys that appear only in one dataset are eliminated.
  • leftOuterJoin keeps all keys from the left dataset even if they don't appear in the right dataset. The result of leftOuterJoin in our example will contain the keys John, Jill, Kate
  • rightOuterJoin keeps all keys from the right dataset even if they don't appear in the left dataset. The result of leftOuterJoin in our example will contain the keys Jill, Grace, John
  • FullOuterJoin keeps all keys from both datasets. The result of leftOuterJoin in our example will contain the keys Jill, Grace, John, Kate

In outer joins, if the element appears only in one dataset, the element in (K,(V,W)) that does not appear in the dataset is represented bye None

Bucket running

C:\Users\Mohanish>aws s3 ls s3://dse-weather
                           PRE 256_STAT/
                           PRE NY.parquet/
                           PRE US_Weather_with_smoothed.parquet/
                           PRE US_stations.parquet/
                           PRE US_weather.parquet/
                           PRE info/
                           PRE weather.parquet/
2016-02-09 09:39:49 1511481989 ALL.csv.gz
2018-04-18 21:18:12        715 ALLBootstrap.sh
2018-04-18 21:18:13        539 MasterBootstrap.sh
2018-04-17 07:55:22        600 PrivateBootstrap.sh
2018-04-18 21:18:14        616 RunFromTerminal.sh
2018-04-19 05:36:22          0 US_Weather_with_smoothed.parquet_$folder$
2018-04-18 21:18:11        421 s3hook.sh


A new interface object has been added in Spark 2.0 called SparkSession. A spark session is initialized using a builder. For example

spark = SparkSession.builder \
         .master("local") \
         .appName("Word Count") \
         .config("spark.some.config.option", "some-value") \
         .getOrCreate()

Using a SparkSession a Parquet file is read as follows::

df = spark.read.parquet('python/test_support/sql/parquet_partitioned')

Comments