hanifa2102/PySpark CheatSheet

Last active April 18, 2019 06:01

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/hanifa2102/ca9cfd9ac7c8cab1316881156fd24e59.js"></script>
Save hanifa2102/ca9cfd9ac7c8cab1316881156fd24e59 to your computer and use it in GitHub Desktop.

Download ZIP

PySpark Intro

Raw

PySpark CheatSheet

	- Create Spark Context to connect to Spark Cluster
	- Have a spark session (use Static method to retrieve it)

Author

hanifa2102 commented Apr 18, 2019

Query tables
spark.catalog.listTables()
r= sparksession.sql("select * from foo")
r.show() , r.count()
pd_r = r.toPandas() # pandas dataframe

Author

hanifa2102 commented Apr 18, 2019 •

edited

Loading

Load Data
- pd_temp=pd.DataFrame(np.random.random(10))
- spark_temp=spark.createDataFrame(pd_temp)
- spark_temp.createOrReplaceTempView("tableName") ## Puts into schema
- data=spark.read.csv(filepath,header=True) ## not sure how to push into catalog

Author

hanifa2102 commented Apr 18, 2019

Modify Columns
df = spark.table("flights")
df.withColumn("mins",df.hour*60)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment