cynthiahqy/all-equal-is-poison.md

Last active October 6, 2021 01:42

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/cynthiahqy/47eb87ca877ce82a4692e5dd8734ebca.js"></script>
Save cynthiahqy/47eb87ca877ce82a4692e5dd8734ebca to your computer and use it in GitHub Desktop.

Download ZIP

all-equal poisons row-wise operations

Raw

all-equal-is-poison.md

ISSUE

base::all.equal() poisons numeric columns by recycling the <chr> message, which in turn makes otherwise row-wise operations like dplyr::filter() and dplyr::mutate() behave in an unexpected manner (i.e. returning all FALSE).

This issue arises if just one row has a numeric difference above the tolerance threshold.

For context, I wanted to use all.equal() to filter for rows where both columns have NA values, since the conditions dplyr::near(x,y) and x == y will drop these rows

In an interactive/analysis setting, you could avoid this issue by dealing directly with NA values. However, it would not necessarily be immediately obvious WHY you are losing all rows -- i.e. it wouldnt be obvious if the bug comes from dplyr::filter, assertthat::are_equal, base::isTRUE or base::all.equal? Moreover, you might want to retain NA values for through some initial data cleaning steps or when programming with dplyr (especially inside another package).

In any case, the workaround I (finally) found is incredibly convoluted, and seems to be at odds with tidyverse principles. Though on further thought, there might be other functions that exhibit similar behavior, but in this particular case the inconsistency between returning a <lgl> vector and <chr> value that gets recycled is quite.. hidden?

SOLUTION

library(magrittr)

# tibble to pass through to are_equal(x,y)
values_df <- tibble::tribble(
  ~x, ~y,
  NA, NA,     ## TRUE
  151, 151,   ## TRUE
  1/3, 0.333,  ## FALSE 
) 

# workaround that isn't super convoluted
not_different <- function(x,y){
  bool <- ((is.na(x) & is.na(y)) | dplyr::near(x,y))
}

values_df %>%
  tidylog::filter(not_different(x,y))
#> filter: removed one row (33%), 2 rows remaining
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1    NA    NA
#> 2   151   151

REPREX

NOTE: the illustration below uses assertthat::are_equal for convenience, but assertthat::are_equal() is just a wrapper for isTRUE(all.equal(x, y, ...))

library(magrittr)

# tibble to pass through to assertthat::are_equal(x,y)
values_df <- tibble::tribble(
  ~x, ~y,     ## EXPECT: are_equal(x, y) =
  NA, NA,     ## TRUE
  151, 151,   ## TRUE
  1/3, 0.333  ## FALSE 
) 

# are_equal() in dplyr::filter()
## EXPECT: return 2 rows
#> [1] TRUE
## ACTUAL: returns NO rows
values_df %>% 
  tidylog::filter(assertthat::are_equal(x, y))
#> filter: removed all rows (100%)
#> # A tibble: 0 × 2
#> # … with 2 variables: x <dbl>, y <dbl>

# dplyr::mutate() illustrates the inconsistency
## EXPECT: assertthat::are_equal(x,y) to evaluate x,y comparison row-wise
## ACTUAL: all.equal() recycles <chr> msg across mutated column, leading to isTRUE() returning FALSE
## in a sense, a single difference "poisons" the whole column.
values_df %>%
  dplyr::mutate(`map2(are_equal)` = purrr::map2(x, y, ~ assertthat::are_equal(unlist(.x), unlist(.y)) ), 
                `are_equal` = assertthat::are_equal(x, y),
                `all.equal` = base::all.equal(x, y),) %>%
  tidyr::unnest(`map2(are_equal)`)
## NOTE: workaround involves purrr:map2, base::unlist, and tidyr::unnest
#> # A tibble: 3 × 5
#>         x       y `map2(are_equal)` are_equal all.equal                      
#>     <dbl>   <dbl> <lgl>             <lgl>     <chr>                          
#> 1  NA      NA     TRUE              FALSE     Mean relative difference: 0.001
#> 2 151     151     TRUE              FALSE     Mean relative difference: 0.001
#> 3   0.333   0.333 FALSE             FALSE     Mean relative difference: 0.001

^{Created on 2021-10-04 by the reprex package (v2.0.0)}

njtierney commented Oct 5, 2021

Sounds like you got where you wanted to be in the end, but here's a workaround, maybe? Based on what Michael said on Twitter

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)

# tibble to pass through to assertthat::are_equal(x,y)
values_df <- tribble(
  ~x, ~y,     ## EXPECT: are_equal(x, y) =
  NA, NA,     ## TRUE
  151, 151,   ## TRUE
  1/3, 0.333  ## FALSE 
) 

# the goal here is to remove things that aren't the same?

values_df %>% 
  filter((is.na(x) & is.na(y)) | x == y)
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1    NA    NA
#> 2   151   151

# alternatively
values_df %>% 
  # set the tolerance to 0.001
  filter((is.na(x) & is.na(y)) | x == y | near(x,y, tol = 0.001))
#> # A tibble: 3 × 2
#>         x       y
#>     <dbl>   <dbl>
#> 1  NA      NA    
#> 2 151     151    
#> 3   0.333   0.333

^{Created on 2021-10-05 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.1.1 (2021-08-10)
#>  os       macOS Big Sur 10.16         
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Perth             
#>  date     2021-10-05                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
#>  backports     1.2.1   2020-12-09 [1] CRAN (R 4.1.0)
#>  cli           3.0.1   2021-07-17 [1] CRAN (R 4.1.0)
#>  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.1.0)
#>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
#>  digest        0.6.28  2021-09-23 [1] CRAN (R 4.1.0)
#>  dplyr       * 1.0.7   2021-06-18 [1] CRAN (R 4.1.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
#>  fansi         0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
#>  generics      0.1.0   2020-10-31 [1] CRAN (R 4.1.0)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.0)
#>  knitr         1.36    2021-09-29 [1] CRAN (R 4.1.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.0)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
#>  pillar        1.6.3   2021-09-26 [1] CRAN (R 4.1.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.0)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.0)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.1.0)
#>  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
#>  stringi       1.7.4   2021-08-25 [1] CRAN (R 4.1.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  styler        1.6.2   2021-09-23 [1] CRAN (R 4.1.0)
#>  tibble      * 3.1.5   2021-09-30 [1] CRAN (R 4.1.0)
#>  tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun          0.26    2021-09-14 [1] CRAN (R 4.1.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

huizezhang-sherry commented Oct 5, 2021

Or you can use the rowwise() operation:

library(magrittr)
values_df <- tibble::tribble(
  ~x, ~y,     ## EXPECT: are_equal(x, y) =
  NA, NA,     ## TRUE
  151, 151,   ## TRUE
  1/3, 0.333  ## FALSE 
) 

values_df %>% 
  dplyr::rowwise() %>% 
  tidylog::filter(assertthat::are_equal(x, y))
#> filter: removed one row (33%), 2 rows remaining
#> # A tibble: 2 × 2
#> # Rowwise: 
#>       x     y
#>   <dbl> <dbl>
#> 1    NA    NA
#> 2   151   151

^{Created on 2021-10-05 by the reprex package (v2.0.1)}

Author

cynthiahqy commented Oct 6, 2021

Oooh. Thanks @huizezhang-sherry & @njtierney! rowwise() really is much more elegant than what I came up with! I always wondered when I saw it in the dplyr docs what it does.

Though I wonder what the performance cost might be on larger datasets -- I had a look at the source code and it seems to basically add turn each row into a single group. I assume it then also requires an ungroup() statement afterwards.. and maybe an extra group_by() statement to regroup??

In any case, I've come up with the following based on everyone's input.

library(magrittr)

# tibble to pass through to are_equal(x,y)
values_df <- tibble::tribble(
  ~x, ~y,
  NA, NA,     ## TRUE
  151, 151,   ## TRUE
  1/3, 0.333,  ## FALSE 
) 

# workaround that isn't super convoluted
not_different <- function(x,y){
  bool <- ((is.na(x) & is.na(y)) | dplyr::near(x,y))
}

values_df %>%
  tidylog::filter(not_different(x,y))
#> filter: removed one row (33%), 2 rows remaining
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1    NA    NA
#> 2   151   151

^{Created on 2021-10-06 by the reprex package (v2.0.0)}

Author

cynthiahqy commented Oct 6, 2021

also for completeness/future reference, the rest of the get NA,NA to return TRUE rabbit hole I went down.

base::identical()
==
dplyr::near()
base::all.equal() / assertthat::are_equal()

library(magrittr)
values_df <- tibble::tribble(
  ~x, ~y,
  NA, NA,      ## TRUE
  151, 151,    ## TRUE
  1/3, 0.333,  ## FOR TRUE: tol=0.001 in all.equal/near/are_equal
  300, 123     ## FALSE
) 

## {tidylog} wraps {dplyr} fncs and generates "removed __ rows" msg

## base::identical() is too strict!----
values_df %>%
  tidylog::filter(base::identical(x, y))
#> filter: removed all rows (100%)
#> # A tibble: 0 × 2
#> # … with 2 variables: x <dbl>, y <dbl>
# notice 151 != 151 because encoding occurs to max decimals of the col.
#   i.e. x = 151.0000000 but x = 151.000
#   which comes from the difference in 1/3 and 0.333
dplyr::glimpse(values_df)
#> Rows: 4
#> Columns: 2
#> $ x <dbl> NA, 151.0000000, 0.3333333, 300.0000000
#> $ y <dbl> NA, 151.000, 0.333, 123.000

## `==` spits NA back! ----
values_df %>%
  tidylog::filter((x == y))
#> filter: removed 3 rows (75%), one row remaining
#> # A tibble: 1 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1   151   151
# notice that NA row is lost since (NA == NA) is NA not TRUE

## dplyr::near() also spits NA back! ----
# function (x, y, tol = .Machine$double.eps^0.5) 
# {
#     abs(x - y) < tol
# }
values_df %>%
  tidylog::filter(dplyr::near(x, y))
#> filter: removed 3 rows (75%), one row remaining
#> # A tibble: 1 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1   151   151
## notice that NA row is lost since (NA == NA) is NA not TRUE
## keeps 1/3 == 0.333 as well with adjusted tolerance
values_df %>%
  tidylog::filter(dplyr::near(x, y, tol=0.001))
#> filter: removed 2 rows (50%), 2 rows remaining
#> # A tibble: 2 × 2
#>         x       y
#>     <dbl>   <dbl>
#> 1 151     151    
#> 2   0.333   0.333

## base::all.equal() is not vectorised? ----
# returns <chr> when not equal so use near/are_equal to ensure filter works
# docs suggest using isTRUE(all.equal(...))
values_df %>%
  tidylog::filter(base::all.equal(x, y))
#> Error: Problem with `filter()` input `..1`.
#> ℹ Input `..1` is `base::all.equal(x, y)`.
#> x Input `..1` must be a logical vector, not a character.
# base::all.equal returns <lgl> with adjusted tolerance -- maybe also just recycling?
values_df %>%
  tidylog::filter(base::all.equal(x, y, tolerance=0.001))
#> Error: Problem with `filter()` input `..1`.
#> ℹ Input `..1` is `base::all.equal(x, y, tolerance = 0.001)`.
#> x Input `..1` must be a logical vector, not a character.

## assertthat::are_equal() ----
# function (x, y, ...) 
# {
#     isTRUE(all.equal(x, y, ...))
# }
values_df %>% 
  tidylog::filter(assertthat::are_equal(x, y))
#> filter: removed all rows (100%)
#> # A tibble: 0 × 2
#> # … with 2 variables: x <dbl>, y <dbl>
## seems to be an issue with pass through of rows to isTRUE()
## since both direct input of NA & assigned NA return TRUE
isTRUE(all.equal(NA, NA))
#> [1] TRUE
a <- NA
b <- NA
isTRUE(all.equal(a, b))
#> [1] TRUE

(equality_df <- values_df %>%
  dplyr::mutate(`map2(are_equal)` = purrr::map2(x, y, ~ assertthat::are_equal(unlist(.x), unlist(.y)) ), 
                `all.equal` = base::all.equal(x, y),
                `are_equal` = assertthat::are_equal(x, y)) %>%
  tidyr::unnest(`map2(are_equal)`) %>%
  dplyr::mutate(`identical` = base::identical(x, y),
                `==` = (x == y),
                `near` = dplyr::near(x, y),
                `isTRUE(all.equal)` = base::isTRUE(all.equal(x,y))))
#> # A tibble: 4 × 9
#>         x       y `map2(are_equal)` all.equal    are_equal identical `==`  near 
#>     <dbl>   <dbl> <lgl>             <chr>        <lgl>     <lgl>     <lgl> <lgl>
#> 1  NA      NA     TRUE              Mean relati… FALSE     FALSE     NA    NA   
#> 2 151     151     TRUE              Mean relati… FALSE     FALSE     TRUE  TRUE 
#> 3   0.333   0.333 FALSE             Mean relati… FALSE     FALSE     FALSE FALSE
#> 4 300     123     FALSE             Mean relati… FALSE     FALSE     FALSE FALSE
#> # … with 1 more variable: isTRUE(all.equal) <lgl>

Aside on filtering using a <lgl> column...

## base::isTRUE doesn't like columns?
## rlang::is_true() also doesn't work?
# should return ONE rows, but returns none
equality_df %>%
  tidylog::filter(base::isTRUE(near))
#> filter: removed all rows (100%)
#> # A tibble: 0 × 9
#> # … with 9 variables: x <dbl>, y <dbl>, map2(are_equal) <lgl>, all.equal <chr>,
#> #   are_equal <lgl>, identical <lgl>, == <lgl>, near <lgl>,
#> #   isTRUE(all.equal) <lgl>

equality_df %>%
  tidylog::filter(rlang::is_true(near))
#> filter: removed all rows (100%)
#> # A tibble: 0 × 9
#> # … with 9 variables: x <dbl>, y <dbl>, map2(are_equal) <lgl>, all.equal <chr>,
#> #   are_equal <lgl>, identical <lgl>, == <lgl>, near <lgl>,
#> #   isTRUE(all.equal) <lgl>

## at least `==` works!
equality_df %>%
  tidylog::filter(near == TRUE)
#> filter: removed 3 rows (75%), one row remaining
#> # A tibble: 1 × 9
#>       x     y `map2(are_equal)` all.equal        are_equal identical `==`  near 
#>   <dbl> <dbl> <lgl>             <chr>            <lgl>     <lgl>     <lgl> <lgl>
#> 1   151   151 TRUE              Mean relative d… FALSE     FALSE     TRUE  TRUE 
#> # … with 1 more variable: isTRUE(all.equal) <lgl>

^{Created on 2021-10-06 by the reprex package (v2.0.0)}

njtierney commented Oct 6, 2021

@cynthiahqy nice!

You should put this into a blog post somewhere, I think a lot of people would find this useful!