pyspark cheat sheet

 1. To get the distinct values in the column of a pyspark dataframe.

jdf.select('State').distinct().show()


Here jdf is the data frame. 'State' is a column of the jdf dataframe and the above command would return the distinct values in the state column

Reference: https://www.datasciencemadesimple.com/distinct-value-of-a-column-in-pyspark/


2. Loading data or pickle file from Azure Data Bricks to Azure Storage account container using Azure Storage SDK for python

https://github.com/mohanish12/sparkNotebooks/blob/main/Upload_databricks_AzureStorage.ipynb?short_path=521a0c3


3. Write data frame as parquet on an Azure storage account container 

blob_account_name= ""

blob_container_name = ""

blob_sas_token = "https://*.blob.core.windows.net/weather?sp="

blob_relative_path = "snwd_NJ3.parquet"

wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)

spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)


print('Remote blob path: ' + wasbs_path)

jdf_nj.write.parquet("databricks/driver/snwd_nj.parquet")

where jdf_nj is the dataframe name



4. If while reading a parquet using the data frame .read method like below you get an error stating "AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually." this might be because the parquet creation did not complete successfully. If you look inside a successfull parquet cretion you would find the very first file named as success however in case of failures you would find the first file named as _Started. To workaround the error delete the _Started file and that would fix the issue







Comments