pyspark cheat sheet
1. To get the distinct values in the column of a pyspark dataframe.
jdf.select('State').distinct().show()
Here jdf is the data frame. 'State' is a column of the jdf dataframe and the above command would return the distinct values in the state column
Reference: https://www.datasciencemadesimple.com/distinct-value-of-a-column-in-pyspark/
2. Loading data or pickle file from Azure Data Bricks to Azure Storage account container using Azure Storage SDK for python
https://github.com/mohanish12/sparkNotebooks/blob/main/Upload_databricks_AzureStorage.ipynb?short_path=521a0c3
3. Write data frame as parquet on an Azure storage account container
blob_account_name= ""
blob_container_name = ""
blob_sas_token = "https://*.blob.core.windows.net/weather?sp="
blob_relative_path = "snwd_NJ3.parquet"
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
print('Remote blob path: ' + wasbs_path)
jdf_nj.write.parquet("databricks/driver/snwd_nj.parquet")
where jdf_nj is the dataframe name
4. If while reading a parquet using the data frame .read method like below you get an error stating "AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually." this might be because the parquet creation did not complete successfully. If you look inside a successfull parquet cretion you would find the very first file named as success however in case of failures you would find the first file named as _Started. To workaround the error delete the _Started file and that would fix the issue
Comments
Post a Comment