Last active
April 18, 2019 06:01
-
-
Save hanifa2102/ca9cfd9ac7c8cab1316881156fd24e59 to your computer and use it in GitHub Desktop.
PySpark Intro
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Create Spark Context to connect to Spark Cluster | |
- Have a spark session (use Static method to retrieve it) | |
Author
hanifa2102
commented
Apr 18, 2019
- Query tables
- spark.catalog.listTables()
- r= sparksession.sql("select * from foo")
- r.show() , r.count()
- pd_r = r.toPandas() # pandas dataframe
- Load Data
-
pd_temp=pd.DataFrame(np.random.random(10))
-
spark_temp=spark.createDataFrame(pd_temp)
-
spark_temp.createOrReplaceTempView("tableName") ## Puts into schema
-
data=spark.read.csv(filepath,header=True) ## not sure how to push into catalog
-
- Modify Columns
- df = spark.table("flights")
- df.withColumn("mins",df.hour*60)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment