base::all.equal()
poisons numeric columns by recycling the <chr>
message,
which in turn makes otherwise row-wise operations like dplyr::filter()
and dplyr::mutate()
behave in an unexpected manner (i.e. returning all FALSE).
This issue arises if just one row has a numeric difference above the tolerance threshold.
For context, I wanted to use all.equal() to filter for rows where both columns have NA
values,
since the conditions dplyr::near(x,y)
and x == y
will drop these rows
In an interactive/analysis setting, you could avoid this issue by dealing directly with NA
values. However, it would not necessarily be immediately obvious WHY you are losing all rows -- i.e. it wouldnt be obvious if the bug comes from dplyr::filter
, assertthat::are_equal
, base::isTRUE
or base::all.equal
? Moreover, you might want to retain NA
values for through some initial data cleaning steps or when programming with dplyr (especially inside another package).
In any case, the workaround I (finally) found is incredibly convoluted, and seems to be at odds with tidyverse principles. Though on further thought, there might be other functions that exhibit similar behavior, but in this particular case the inconsistency between returning a <lgl>
vector and <chr>
value that gets recycled is quite.. hidden?
library(magrittr)
# tibble to pass through to are_equal(x,y)
values_df <- tibble::tribble(
~x, ~y,
NA, NA, ## TRUE
151, 151, ## TRUE
1/3, 0.333, ## FALSE
)
# workaround that isn't super convoluted
not_different <- function(x,y){
bool <- ((is.na(x) & is.na(y)) | dplyr::near(x,y))
}
values_df %>%
tidylog::filter(not_different(x,y))
#> filter: removed one row (33%), 2 rows remaining
#> # A tibble: 2 × 2
#> x y
#> <dbl> <dbl>
#> 1 NA NA
#> 2 151 151
NOTE: the illustration below uses assertthat::are_equal
for convenience, but
assertthat::are_equal()
is just a wrapper for isTRUE(all.equal(x, y, ...))
library(magrittr)
# tibble to pass through to assertthat::are_equal(x,y)
values_df <- tibble::tribble(
~x, ~y, ## EXPECT: are_equal(x, y) =
NA, NA, ## TRUE
151, 151, ## TRUE
1/3, 0.333 ## FALSE
)
# are_equal() in dplyr::filter()
## EXPECT: return 2 rows
#> [1] TRUE
## ACTUAL: returns NO rows
values_df %>%
tidylog::filter(assertthat::are_equal(x, y))
#> filter: removed all rows (100%)
#> # A tibble: 0 × 2
#> # … with 2 variables: x <dbl>, y <dbl>
# dplyr::mutate() illustrates the inconsistency
## EXPECT: assertthat::are_equal(x,y) to evaluate x,y comparison row-wise
## ACTUAL: all.equal() recycles <chr> msg across mutated column, leading to isTRUE() returning FALSE
## in a sense, a single difference "poisons" the whole column.
values_df %>%
dplyr::mutate(`map2(are_equal)` = purrr::map2(x, y, ~ assertthat::are_equal(unlist(.x), unlist(.y)) ),
`are_equal` = assertthat::are_equal(x, y),
`all.equal` = base::all.equal(x, y),) %>%
tidyr::unnest(`map2(are_equal)`)
## NOTE: workaround involves purrr:map2, base::unlist, and tidyr::unnest
#> # A tibble: 3 × 5
#> x y `map2(are_equal)` are_equal all.equal
#> <dbl> <dbl> <lgl> <lgl> <chr>
#> 1 NA NA TRUE FALSE Mean relative difference: 0.001
#> 2 151 151 TRUE FALSE Mean relative difference: 0.001
#> 3 0.333 0.333 FALSE FALSE Mean relative difference: 0.001
Created on 2021-10-04 by the reprex package (v2.0.0)
Sounds like you got where you wanted to be in the end, but here's a workaround, maybe? Based on what Michael said on Twitter
Created on 2021-10-05 by the reprex package (v2.0.1)
Session info