Here are some notes on using R on Databricks and making use of the underlying Spark tooling.
Contents
- SparkR vs sparklyr
- sparklyr function types
- Connect to Snowflake
- Error messages
- tidymodels and sparklyr
- User-defined functions
- Inserting data to SQL
- Spark is slow
- Balanced data
- Resources
SparkR (https://spark.apache.org/docs/latest/api/R/index.html) and sparklyr (https://spark.posit.co/) are both R packages that can interact with the Spark API, but for Databricks Runtime 16.0 and above, SparkR is deprecated and it is recommended to use sparklyr.
SparkR in Databricks is deprecated in Databricks Runtime 16.0 and above. Databricks recommends using sparklyr instead.
Source: https://docs.databricks.com/aws/en/sparkr/overview
One advantage to using sparklyr, is that it is maintained by Posit and it allows you to interface with common tidyverse tools like dplyr, broom, and DBI, and has excellent documentation.
sparklyr provides a link to MLlib functionality, by 3 families of functions:
- ml_* – machine learning algorithms, i.e. ml_linear_regression
- ft_* – feature transformers, i.e. ft_string_indexer
- sdf_* – data frame manipulations, i.e. sdf_random_split
library(sparklyr)
auth <- list(
source = "snowflake",
sfUrl = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
sfUser = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
sfPassword = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
sfDatabase = "DB_NAME",
sfSchema = "SCHEMA",
sfWarehouse = "WAREHOUSE"
)
sc <- sparklyr::spark_connect(master = "local", method = "databricks", config = auth)
# Get table based on table name
sf_df <- sparklyr::spark_read_source(
sc = sc,
source = "snowflake",
name = "df", # Name for table in Spark backend
options = c(auth, dbtable = "YOUR_IMPORTANT_TABLE")
)
# Get table based on SQL query
sql <- "SELECT * FROM YOUR_IMPORTANT_TABLE"
sf_df <- sparklyr::spark_read_source(
sc = sc,
source = "snowflake",
name = "df", # Name for table in Spark backend
options = c(auth, query = sql)
)
Inspired by: https://gist.github.com/sllynn/a2474fb4dfd2bde234c3dda6d07d1229
library(sparklyr)
auth <- list(
source = "snowflake",
sfUrl = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
sfUser = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
sfPassword = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
sfDatabase = "DB_NAME",
sfSchema = "SCHEMA",
sfWarehouse = "WAREHOUSE"
)
sc <- sparklyr::spark_connect(master = "local", method = "databricks", config = auth)
iris_tbl <- sparklyr::copy_to(sc, iris)
sparklyr::spark_write_source(
x = iris_tbl,
source = "snowflake",
mode = "overwrite",
options = c(auth, dbtable = "NAME_OF_TABLE_YOU_WANT_ON_SNOWFLAKE")
)
Run these commands to set up better error messages and to see what the last one was.
options("sparklyr.simple.errors" = TRUE)
sparklyr::spark_last_error()
Full support for tidymodels and sparklyr is still being worked on. Especially preprocessing steps using workflows and recipes is challenging.
So the support for Spark in tidymodels is not even across all the parts of a modeling analysis. The support for modeling in parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows.
https://stackoverflow.com/a/68324650/2468369
Some of this can be replicated with the ml_* functions in sparklyr, but the full gamet of functions in recipes might not be available.
User-defined functions are a way to use Spark to run arbitrary function calls. Spark will take that function call and split it across its workers to fun the data on. It can be split out by group to help explicitly separate out groups to compute on.
Although this may solve some problems, this is not a panacea. Embarassingly parallel problems work best with this method.
More: https://posit.co/blog/databricks-udfs/
For anyone else who struggles with this issue - it relates to how memory is handled for R dataframes in databricks clusters. To work around it, I have found two options so far:
- Convert your df to a partitioned spark dataframe prior to insert (note, you may still need to increase your cluster driver)
spark_df_for_insert <- createDataFrame(r_df, numPartitions=150)
- Stop using R dataframes and switch to spark dataframes. This means you will need to change your code and a package like sparklyr will certainly come in handy.
I hope that helps somebody.
Source: https://stackoverflow.com/a/74434523/2468369
There are some instances where Spark may appear slow. This is especially true when you try to use collect()
on a Spark Data Frame to bring it into a R data frame.
This operation can be slow because Spark is lazy, which means Spark doesn't actually create new data based on the operations you specify. So converting it from Spark to R, it must then apply those operations and then fetch the data across the Spark partitions, which can be expensive to run.
References:
- https://stackoverflow.com/questions/39576740/why-is-collect-in-sparkr-so-slow
- https://dsnotes.com/r-read-hdfs/
Example:
spark_dataframe(iris_tbl) %>%
sparklyr::invoke("stat") %>%
sparklyr::invoke(
"sampleBy",
"Species",
fractions=as.environment(list(
"setosa"=0.2,
"versicolor"=0.2,
"virginica"=0.2
)),
seed=1L
) %>% sparklyr::sdf_register()
Source: https://stackoverflow.com/a/59725431/2468369
- Quick reference cheatsheet https://spark.posit.co/images/homepage/sparklyr.pdf
- Older quick reference sheet https://science.nu/wp-content/uploads/2018/07/r-sparklyr.pdf
- Mastering Spark with R https://therinspark.com/
- User Guides https://spark.posit.co/guides/ (e.g., ML pipelines https://spark.posit.co/guides/pipelines.html)
- Using Spark from R for Performance with Arbitrary Code https://sparkfromr.com/
- Official Databricks
- https://docs.databricks.com/aws/en/sparkr/
- The R Developer's Guide to Databricks https://www.databricks.com/sites/default/files/2025-02/developers-guide-R.pdf