Week1 &2: SparkSQLand DataFrames
DataFrames are special type of RDD's. DataFrames have two dimensional data like spreadsheet and they have rows and they have columns.
SQL context is like SparkContext and lies underneath sparkContext and handles data frames.
df.printSchema defines the schema of the dataFram which is name and type
Variants of join.
There are four variants of join which differ in how they treat keys that appear in one dataset but not the other.
joinis an inner join which means that keys that appear only in one dataset are eliminated.leftOuterJoinkeeps all keys from the left dataset even if they don't appear in the right dataset. The result of leftOuterJoin in our example will contain the keysJohn, Jill, KaterightOuterJoinkeeps all keys from the right dataset even if they don't appear in the left dataset. The result of leftOuterJoin in our example will contain the keysJill, Grace, JohnFullOuterJoinkeeps all keys from both datasets. The result of leftOuterJoin in our example will contain the keysJill, Grace, John, Kate
In outer joins, if the element appears only in one dataset, the element in (K,(V,W)) that does not appear in the dataset is represented bye None
Bucket running
A new interface object has been added in Spark 2.0 called SparkSession. A spark session is initialized using a builder. For example
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Using a SparkSession a Parquet file is read as follows::
df = spark.read.parquet('python/test_support/sql/parquet_partitioned')




















Comments
Post a Comment