How to filter data using assertr? #86

DrAndiLowe · 2019-08-05T17:24:10Z

Hi, quick question: I want a different way to handle errors in my data; instead of halting execution when an error is detected (such as with error_stop and error_warn) or adding a special "assertr_errors" attribute to the data and continuing execution (such as with error_append), I want to filter rows (remove the rows with bad data) and report the errors so they can be displayed at the end of the pipeline. My use case is that I have a huge data.frame in a complex pipeline that typically takes days to run, so I need the pipeline to react dynamically and recover -- removing rows is acceptable -- instead of going crunch-bang after several hours of running. And it would be nice to have some record of which rows were removed, and why. Any ideas on how I could do this with assertr?

The text was updated successfully, but these errors were encountered:

DrAndiLowe · 2019-08-07T12:51:19Z

OK, my solution/kludge looks like this (please let me know if there's a better way):

filter_bad <- function(list_of_errors, data = NULL, ...){
  # We are checking to see if there are any errors that
  # are still attached to the data.frame
  if(!is.null(data) && !is.null(attr(data, "assertr_errors"))) {
    errors <- append(attr(data, "assertr_errors"), errors)
  }
  
  # All `assertr_error` S3 objects have `print` and `summary` methods
  # here; we will call `print` on all of the errors since `print`
  # will give us the complete/unabridged error report
  list_of_errors %>% 
    map(
      function(x) {
        print(x) # Print out the error message
        return(x$error_df) # Get the detailed error information
      }
    ) %>% 
    bind_rows() %>% # Bind together all the detailed error information
    pull(index) %>% # Get the indices of the affected rows
    {.} -> indices
  
  return(data[-indices,]) # Filter out the bad rows and return
}

our.data %>%
  chain_start %>%
  assert(within_bounds(0,Inf), mpg) %>%
  chain_end(error_fun = filter_bad) -> foo

foo

tonyfischetti · 2019-08-13T13:17:25Z

Wow, this would be a really great feature!! Thanks for suggesting it!
Hmm, before I test your solution, I'll think of all the ways I can do this, and test which one is more efficient

DrAndiLowe · 2019-08-13T15:24:04Z

My solution looks like this now:

filter_bad <- function(list_of_errors, data = NULL, ...){
  # We are checking to see if there are any errors that
  # are still attached to the data.frame
  if(!is.null(data) && !is.null(attr(data, "assertr_errors"))) {
    errors <- append(attr(data, "assertr_errors"), errors)
  }
  
  # All `assertr_error` S3 objects have `print` and `summary` methods
  # here; we will call `print` on all of the errors since `print`
  # will give us the complete/unabridged error report
  suppressWarnings(
    list_of_errors %>% 
      purrr::map(
        function(x) {
          message(x$message) # For output logging
          print(x$message) # For output logging
          return(x$error_df) # Get the detailed error information
        }
      ) %>% 
      dplyr::bind_rows() %>% # Bind together all the detailed error information
      {.} -> error_df
  )
  
  error_df %>% 
    dplyr::pull(index) %>% # Get the indices of the affected rows
    {.} -> indices
  
  data %>% 
    tibble::rownames_to_column() %>% # Add a temporary row index
    tidylog::filter(!(rowname %in% indices)) %>% # Filter out the bad rows and log actions
    dplyr::select(-rowname) %>% # Remove temporary row index
    {.} -> data
  
  attr(data, "data_errors") <- error_df # Set an attribute of the data containing errors found
  return(data) 
}

# Test
# our.data %>%
#   chain_start %>%
#   assert(within_bounds(0, Inf), mpg) %>%
#   chain_end(error_fun = filter_bad) -> our.data
# our.data

Not sure if that's actually better, but that's what I'm using. Basically, if an assertion fails, the row is filtered, processing continues without interruption, I get a message from tidylog telling me that rows were removed, and I get a data.frame attached as an attribute to the data that contains a full description of what failed, and why.

tonyfischetti · 2019-08-13T19:33:35Z

Quick question...
Do you think the offending rows should be filtered immediately after each step in the chain or all at once at the end?
If I had to choose one, I'd choose the former

DrAndiLowe · 2019-08-14T15:18:49Z

H'mmm, not sure. For my specific use case, it was better

to gather the indices of all the offending rows and
create a single artefact containing information on what was bad

in a single step, but I could have filtered at each step in the chain and built the error data.frame as the chain progressed. The advantage of the strategy I used is that there may be an overlap between different assertions that result in a row being removed; a row might be bad for more than one reason, and it could be good to know this. If you filter immediately after each step, you won't know that a row has more than one problem, and how it is removed will depend on the order of the assertions in the code. Personal preference, but I prefer to know everything that was wrong with the data.

tonyfischetti · 2019-08-14T19:53:08Z

That makes total sense. I think it's important that there's an option to do it last.
I think there should be an option for both approaches. As an example, I can see a situation where an assertion earlier in the pipeline would cause a downstream assertion to malfunction so I think that would be helpful, too

DrAndiLowe · 2019-09-12T17:03:33Z

After further thought, it's best to filter offending rows in each step rather than all at the end. I encountered some situations in which a crash occurred because the input values to an assertion were nonsensical. For example, in a situation in which values are tested to see if they satisfy is.numeric and a downstream assertion tests to see if these same values are within_bounds, a crash will occur in the latter if values contains something other than a numeric. The crash wouldn't have occurred if those offending rows had been filtered in the former step. So it seems best to apply filtering immediately. A record of all failing assertions is still possible if required.

DrAndiLowe · 2019-09-12T17:05:39Z

I made a bugfix to ensure that "values" are coerced to a consistent type before row binding. My code looks like this now:

filter_bad <- function(list_of_errors, data = NULL, ...){
  # We are checking to see if there are any errors that
  # are still attached to the data.frame
  if(!is.null(data) && !is.null(attr(data, "assertr_errors"))) {
    errors <- append(attr(data, "assertr_errors"), errors)
  }
  
  if(length(list_of_errors) > 0) cat("\nData error(s) detected!\n", file = stderr())
  
  # All `assertr_error` S3 objects have `print` and `summary` methods
  # here; we will call `print` on all of the errors since `print`
  # will give us the complete/unabridged error report
  suppressWarnings(
    list_of_errors %>% 
      furrr::future_map(
        function(x) {
          message(x$message) # For output logging
          print(x$message) # For output logging
          x$error_df %>% # Get the error information held in a DF
            dplyr::mutate(value = as.character(value)) %>% # Type consistency for row binding
            return(.) # Get the detailed error information
        }
      ) %>% 
      dplyr::bind_rows() %>% # Bind together all the detailed error information
      {.} -> error_df
  )
  
  error_df %>% 
    dplyr::pull(index) %>% # Get the indices of the affected rows
    unique() %>% 
    {.} -> indices
  
  data %>% 
    tibble::rownames_to_column() %>% # Add a temporary row index
    tidylog::filter(!(rowname %in% indices)) %>% # Filter out the bad rows and log actions
    dplyr::select(-rowname) %>% # Remove temporary row index
    {.} -> data
  
  attr(data, "data_errors") <- error_df # Set an attribute of the data containing errors found
  return(data) 
}

tonyfischetti · 2021-01-25T16:40:52Z

Heads up. I think the latest version of assertr might be able to help you.
Specifically, these functions that are now in the vignette thanks to @krystian8207

defect_report - For single rule and defective data it displays short info about skipping the current assertion. For chain_end sums up all skipped rules for defective data.

defect_df_return - For single rule and defective data it returns info data.frame about skipping current assertion. For chain_end returns all skipped rules info data.frame for defective data.

Let me know if this works for you. If not, please re-open the issue

DrAndiLowe mentioned this issue Sep 5, 2019

Error in data.frame(verb = verb, redux_fn = NA, predicate = name.of.predicate, : arguments imply differing number of rows #88

Open

tonyfischetti closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to filter data using assertr? #86

How to filter data using assertr? #86

DrAndiLowe commented Aug 5, 2019

DrAndiLowe commented Aug 7, 2019

tonyfischetti commented Aug 13, 2019

DrAndiLowe commented Aug 13, 2019

tonyfischetti commented Aug 13, 2019

DrAndiLowe commented Aug 14, 2019

tonyfischetti commented Aug 14, 2019

DrAndiLowe commented Sep 12, 2019

DrAndiLowe commented Sep 12, 2019

tonyfischetti commented Jan 25, 2021

How to filter data using assertr? #86

How to filter data using assertr? #86

Comments

DrAndiLowe commented Aug 5, 2019

DrAndiLowe commented Aug 7, 2019

tonyfischetti commented Aug 13, 2019

DrAndiLowe commented Aug 13, 2019

tonyfischetti commented Aug 13, 2019

DrAndiLowe commented Aug 14, 2019

tonyfischetti commented Aug 14, 2019

DrAndiLowe commented Sep 12, 2019

DrAndiLowe commented Sep 12, 2019

tonyfischetti commented Jan 25, 2021