Spark, R, and Databricks Notes

Here are some notes on using R on Databricks and making use of the underlying Spark tooling.

Contents

SparkR vs sparklyr
sparklyr function types
Connect to Snowflake
Error messages
tidymodels and sparklyr
User-defined functions
Inserting data to SQL
Spark is slow
Balanced data
Resources

SparkR vs sparklyr

SparkR (https://spark.apache.org/docs/latest/api/R/index.html) and sparklyr (https://spark.posit.co/) are both R packages that can interact with the Spark API, but for Databricks Runtime 16.0 and above, SparkR is deprecated and it is recommended to use sparklyr.

SparkR in Databricks is deprecated in Databricks Runtime 16.0 and above. Databricks recommends using sparklyr instead.

Source: https://docs.databricks.com/aws/en/sparkr/overview

One advantage to using sparklyr, is that it is maintained by Posit and it allows you to interface with common tidyverse tools like dplyr, broom, and DBI, and has excellent documentation.

sparklyr function types

sparklyr provides a link to MLlib functionality, by 3 families of functions:

ml_* – machine learning algorithms, i.e. ml_linear_regression
ft_* – feature transformers, i.e. ft_string_indexer
sdf_* – data frame manipulations, i.e. sdf_random_split

Connect to Snowflake

Read from Snowflake

library(sparklyr)

auth <- list(
  source = "snowflake", 
  sfUrl = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfUser = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfPassword =  dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfDatabase = "DB_NAME",
  sfSchema = "SCHEMA",
  sfWarehouse = "WAREHOUSE"
)
sc <- sparklyr::spark_connect(master = "local", method = "databricks", config = auth)

# Get table based on table name
sf_df <- sparklyr::spark_read_source(
  sc = sc,
  source = "snowflake",
  name = "df", # Name for table in Spark backend
  options = c(auth, dbtable = "YOUR_IMPORTANT_TABLE")
)

# Get table based on SQL query
sql <- "SELECT * FROM YOUR_IMPORTANT_TABLE"
sf_df <- sparklyr::spark_read_source(
  sc = sc, 
  source = "snowflake",
  name = "df", # Name for table in Spark backend
  options = c(auth, query = sql)
)

Inspired by: https://gist.github.com/sllynn/a2474fb4dfd2bde234c3dda6d07d1229

Write to Snowflake

library(sparklyr)

auth <- list(
  source = "snowflake", 
  sfUrl = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfUser = dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfPassword =  dbutils.secrets.get(scope = "NAME", key = "VALUE"),
  sfDatabase = "DB_NAME",
  sfSchema = "SCHEMA",
  sfWarehouse = "WAREHOUSE"
)
sc <- sparklyr::spark_connect(master = "local", method = "databricks", config = auth)

iris_tbl <- sparklyr::copy_to(sc, iris)

sparklyr::spark_write_source(
  x = iris_tbl,
  source = "snowflake",
  mode = "overwrite",
  options = c(auth, dbtable = "NAME_OF_TABLE_YOU_WANT_ON_SNOWFLAKE")
)

Error messages

Run these commands to set up better error messages and to see what the last one was.

options("sparklyr.simple.errors" = TRUE)
sparklyr::spark_last_error()

tidymodels and sparklyr

Full support for tidymodels and sparklyr is still being worked on. Especially preprocessing steps using workflows and recipes is challenging.

So the support for Spark in tidymodels is not even across all the parts of a modeling analysis. The support for modeling in parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows.

https://stackoverflow.com/a/68324650/2468369

Some of this can be replicated with the ml_* functions in sparklyr, but the full gamet of functions in recipes might not be available.

User-defined functions

User-defined functions are a way to use Spark to run arbitrary function calls. Spark will take that function call and split it across its workers to fun the data on. It can be split out by group to help explicitly separate out groups to compute on.

Although this may solve some problems, this is not a panacea. Embarassingly parallel problems work best with this method.

More: https://posit.co/blog/databricks-udfs/

Inserting data to SQL

For anyone else who struggles with this issue - it relates to how memory is handled for R dataframes in databricks clusters. To work around it, I have found two options so far:

Convert your df to a partitioned spark dataframe prior to insert (note, you may still need to increase your cluster driver)

spark_df_for_insert <- createDataFrame(r_df, numPartitions=150)

Stop using R dataframes and switch to spark dataframes. This means you will need to change your code and a package like sparklyr will certainly come in handy.

I hope that helps somebody.

Source: https://stackoverflow.com/a/74434523/2468369

Spark is slow

There are some instances where Spark may appear slow. This is especially true when you try to use collect() on a Spark Data Frame to bring it into a R data frame.

This operation can be slow because Spark is lazy, which means Spark doesn't actually create new data based on the operations you specify. So converting it from Spark to R, it must then apply those operations and then fetch the data across the Spark partitions, which can be expensive to run.

References:

Balanced data

Example:

spark_dataframe(iris_tbl) %>%
  sparklyr::invoke("stat") %>%
  sparklyr::invoke(
    "sampleBy",
    "Species",
    fractions=as.environment(list(
      "setosa"=0.2,
      "versicolor"=0.2,
      "virginica"=0.2
    )),
    seed=1L
  ) %>% sparklyr::sdf_register()

Source: https://stackoverflow.com/a/59725431/2468369

Resources

Quick reference cheatsheet https://spark.posit.co/images/homepage/sparklyr.pdf
- Older quick reference sheet https://science.nu/wp-content/uploads/2018/07/r-sparklyr.pdf
Mastering Spark with R https://therinspark.com/
User Guides https://spark.posit.co/guides/ (e.g., ML pipelines https://spark.posit.co/guides/pipelines.html)
Using Spark from R for Performance with Arbitrary Code https://sparkfromr.com/
Official Databricks
- https://docs.databricks.com/aws/en/sparkr/
- The R Developer's Guide to Databricks https://www.databricks.com/sites/default/files/2025-02/developers-guide-R.pdf

erictleung/spark_r_notes.md