Skip to content

Instantly share code, notes, and snippets.

@erictleung
Last active July 18, 2025 19:48
Show Gist options
  • Save erictleung/b76fcdf1463ef8cdc7240a9a71dd925a to your computer and use it in GitHub Desktop.
Save erictleung/b76fcdf1463ef8cdc7240a9a71dd925a to your computer and use it in GitHub Desktop.
SparkR, sparklyr, and Databricks notes

Spark, R, and Databricks Notes

Here are some notes on using R on Databricks and making use of the underlying Spark tooling.

Contents

SparkR vs sparklyr

SparkR (https://spark.apache.org/docs/latest/api/R/index.html) and sparklyr (https://spark.posit.co/) are both R packages that can interact with the Spark API, but for Databricks Runtime 16.0 and above, SparkR is deprecated and it is recommended to use sparklyr.

SparkR in Databricks is deprecated in Databricks Runtime 16.0 and above. Databricks recommends using sparklyr instead.

Source: https://docs.databricks.com/aws/en/sparkr/overview

One advantage to using sparklyr, is that it is maintained by Posit and it allows you to interface with common tidyverse tools like dplyr, broom, and DBI, and has excellent documentation.

sparklyr function types

sparklyr provides a link to MLlib functionality, by 3 families of functions:

  • ml_* – machine learning algorithms, i.e. ml_linear_regression
  • ft_* – feature transformers, i.e. ft_string_indexer
  • sdf_* – data frame manipulations, i.e. sdf_random_split

Connect to Snowflake

Read from Snowflake

library(sparklyr)

auth <- list(
  source = "snowflake", 
  sfUrl = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfUser = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfPassword =  dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfDatabase = "DB_NAME",
  sfSchema = "SCHEMA",
  sfWarehouse = "WAREHOUSE"
)
sc <- sparklyr::spark_connect(master = "local", method = "databricks", config = auth)

# Get table based on table name
sf_df <- sparklyr::spark_read_source(
  sc = sc,
  source = "snowflake",
  name = "df", # Name for table in Spark backend
  options = c(auth, dbtable = "YOUR_IMPORTANT_TABLE")
)

# Get table based on SQL query
sql <- "SELECT * FROM YOUR_IMPORTANT_TABLE"
sf_df <- sparklyr::spark_read_source(
  sc = sc, 
  source = "snowflake",
  name = "df", # Name for table in Spark backend
  options = c(auth, query = sql)
)

Inspired by: https://gist.github.com/sllynn/a2474fb4dfd2bde234c3dda6d07d1229

Write to Snowflake

library(sparklyr)

auth <- list(
  source = "snowflake", 
  sfUrl = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfUser = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfPassword =  dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfDatabase = "DB_NAME",
  sfSchema = "SCHEMA",
  sfWarehouse = "WAREHOUSE"
)
sc <- sparklyr::spark_connect(master = "local", method = "databricks", config = auth)

iris_tbl <- sparklyr::copy_to(sc, iris)

sparklyr::spark_write_source(
  x = iris_tbl,
  source = "snowflake",
  mode = "overwrite",
  options = c(auth, dbtable = "NAME_OF_TABLE_YOU_WANT_ON_SNOWFLAKE")
)

Error messages

Run these commands to set up better error messages and to see what the last one was.

options("sparklyr.simple.errors" = TRUE)
sparklyr::spark_last_error()

tidymodels and sparklyr

Full support for tidymodels and sparklyr is still being worked on. Especially preprocessing steps using workflows and recipes is challenging.

So the support for Spark in tidymodels is not even across all the parts of a modeling analysis. The support for modeling in parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows.

https://stackoverflow.com/a/68324650/2468369

Some of this can be replicated with the ml_* functions in sparklyr, but the full gamet of functions in recipes might not be available.

User-defined functions

User-defined functions are a way to use Spark to run arbitrary function calls. Spark will take that function call and split it across its workers to fun the data on. It can be split out by group to help explicitly separate out groups to compute on.

Although this may solve some problems, this is not a panacea. Embarassingly parallel problems work best with this method.

More: https://posit.co/blog/databricks-udfs/

Inserting data to SQL

For anyone else who struggles with this issue - it relates to how memory is handled for R dataframes in databricks clusters. To work around it, I have found two options so far:

  1. Convert your df to a partitioned spark dataframe prior to insert (note, you may still need to increase your cluster driver)

spark_df_for_insert <- createDataFrame(r_df, numPartitions=150)

  1. Stop using R dataframes and switch to spark dataframes. This means you will need to change your code and a package like sparklyr will certainly come in handy.

I hope that helps somebody.

Source: https://stackoverflow.com/a/74434523/2468369

Spark is slow

There are some instances where Spark may appear slow. This is especially true when you try to use collect() on a Spark Data Frame to bring it into a R data frame.

This operation can be slow because Spark is lazy, which means Spark doesn't actually create new data based on the operations you specify. So converting it from Spark to R, it must then apply those operations and then fetch the data across the Spark partitions, which can be expensive to run.

References:

Balanced data

Example:

spark_dataframe(iris_tbl) %>%
  sparklyr::invoke("stat") %>%
  sparklyr::invoke(
    "sampleBy",
    "Species",
    fractions=as.environment(list(
      "setosa"=0.2,
      "versicolor"=0.2,
      "virginica"=0.2
    )),
    seed=1L
  ) %>% sparklyr::sdf_register()

Source: https://stackoverflow.com/a/59725431/2468369

Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment