Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to filter data using assertr? #86

Closed
DrAndiLowe opened this issue Aug 5, 2019 · 9 comments
Closed

How to filter data using assertr? #86

DrAndiLowe opened this issue Aug 5, 2019 · 9 comments

Comments

@DrAndiLowe
Copy link

Hi, quick question: I want a different way to handle errors in my data; instead of halting execution when an error is detected (such as with error_stop and error_warn) or adding a special "assertr_errors" attribute to the data and continuing execution (such as with error_append), I want to filter rows (remove the rows with bad data) and report the errors so they can be displayed at the end of the pipeline. My use case is that I have a huge data.frame in a complex pipeline that typically takes days to run, so I need the pipeline to react dynamically and recover -- removing rows is acceptable -- instead of going crunch-bang after several hours of running. And it would be nice to have some record of which rows were removed, and why. Any ideas on how I could do this with assertr?

@DrAndiLowe
Copy link
Author

OK, my solution/kludge looks like this (please let me know if there's a better way):

filter_bad <- function(list_of_errors, data = NULL, ...){
  # We are checking to see if there are any errors that
  # are still attached to the data.frame
  if(!is.null(data) && !is.null(attr(data, "assertr_errors"))) {
    errors <- append(attr(data, "assertr_errors"), errors)
  }
  
  # All `assertr_error` S3 objects have `print` and `summary` methods
  # here; we will call `print` on all of the errors since `print`
  # will give us the complete/unabridged error report
  list_of_errors %>% 
    map(
      function(x) {
        print(x) # Print out the error message
        return(x$error_df) # Get the detailed error information
      }
    ) %>% 
    bind_rows() %>% # Bind together all the detailed error information
    pull(index) %>% # Get the indices of the affected rows
    {.} -> indices
  
  return(data[-indices,]) # Filter out the bad rows and return
}

our.data %>%
  chain_start %>%
  assert(within_bounds(0,Inf), mpg) %>%
  chain_end(error_fun = filter_bad) -> foo

foo

@tonyfischetti
Copy link
Owner

Wow, this would be a really great feature!! Thanks for suggesting it!
Hmm, before I test your solution, I'll think of all the ways I can do this, and test which one is more efficient

@DrAndiLowe
Copy link
Author

My solution looks like this now:

filter_bad <- function(list_of_errors, data = NULL, ...){
  # We are checking to see if there are any errors that
  # are still attached to the data.frame
  if(!is.null(data) && !is.null(attr(data, "assertr_errors"))) {
    errors <- append(attr(data, "assertr_errors"), errors)
  }
  
  # All `assertr_error` S3 objects have `print` and `summary` methods
  # here; we will call `print` on all of the errors since `print`
  # will give us the complete/unabridged error report
  suppressWarnings(
    list_of_errors %>% 
      purrr::map(
        function(x) {
          message(x$message) # For output logging
          print(x$message) # For output logging
          return(x$error_df) # Get the detailed error information
        }
      ) %>% 
      dplyr::bind_rows() %>% # Bind together all the detailed error information
      {.} -> error_df
  )
  
  error_df %>% 
    dplyr::pull(index) %>% # Get the indices of the affected rows
    {.} -> indices
  
  data %>% 
    tibble::rownames_to_column() %>% # Add a temporary row index
    tidylog::filter(!(rowname %in% indices)) %>% # Filter out the bad rows and log actions
    dplyr::select(-rowname) %>% # Remove temporary row index
    {.} -> data
  
  attr(data, "data_errors") <- error_df # Set an attribute of the data containing errors found
  return(data) 
}

# Test
# our.data %>%
#   chain_start %>%
#   assert(within_bounds(0, Inf), mpg) %>%
#   chain_end(error_fun = filter_bad) -> our.data
# our.data

Not sure if that's actually better, but that's what I'm using. Basically, if an assertion fails, the row is filtered, processing continues without interruption, I get a message from tidylog telling me that rows were removed, and I get a data.frame attached as an attribute to the data that contains a full description of what failed, and why.

@tonyfischetti
Copy link
Owner

Quick question...
Do you think the offending rows should be filtered immediately after each step in the chain or all at once at the end?
If I had to choose one, I'd choose the former

@DrAndiLowe
Copy link
Author

H'mmm, not sure. For my specific use case, it was better

  • to gather the indices of all the offending rows and
  • create a single artefact containing information on what was bad

in a single step, but I could have filtered at each step in the chain and built the error data.frame as the chain progressed. The advantage of the strategy I used is that there may be an overlap between different assertions that result in a row being removed; a row might be bad for more than one reason, and it could be good to know this. If you filter immediately after each step, you won't know that a row has more than one problem, and how it is removed will depend on the order of the assertions in the code. Personal preference, but I prefer to know everything that was wrong with the data.

@tonyfischetti
Copy link
Owner

That makes total sense. I think it's important that there's an option to do it last.
I think there should be an option for both approaches. As an example, I can see a situation where an assertion earlier in the pipeline would cause a downstream assertion to malfunction so I think that would be helpful, too

@DrAndiLowe
Copy link
Author

After further thought, it's best to filter offending rows in each step rather than all at the end. I encountered some situations in which a crash occurred because the input values to an assertion were nonsensical. For example, in a situation in which values are tested to see if they satisfy is.numeric and a downstream assertion tests to see if these same values are within_bounds, a crash will occur in the latter if values contains something other than a numeric. The crash wouldn't have occurred if those offending rows had been filtered in the former step. So it seems best to apply filtering immediately. A record of all failing assertions is still possible if required.

@DrAndiLowe
Copy link
Author

I made a bugfix to ensure that "values" are coerced to a consistent type before row binding. My code looks like this now:

filter_bad <- function(list_of_errors, data = NULL, ...){
  # We are checking to see if there are any errors that
  # are still attached to the data.frame
  if(!is.null(data) && !is.null(attr(data, "assertr_errors"))) {
    errors <- append(attr(data, "assertr_errors"), errors)
  }
  
  if(length(list_of_errors) > 0) cat("\nData error(s) detected!\n", file = stderr())
  
  # All `assertr_error` S3 objects have `print` and `summary` methods
  # here; we will call `print` on all of the errors since `print`
  # will give us the complete/unabridged error report
  suppressWarnings(
    list_of_errors %>% 
      furrr::future_map(
        function(x) {
          message(x$message) # For output logging
          print(x$message) # For output logging
          x$error_df %>% # Get the error information held in a DF
            dplyr::mutate(value = as.character(value)) %>% # Type consistency for row binding
            return(.) # Get the detailed error information
        }
      ) %>% 
      dplyr::bind_rows() %>% # Bind together all the detailed error information
      {.} -> error_df
  )
  
  error_df %>% 
    dplyr::pull(index) %>% # Get the indices of the affected rows
    unique() %>% 
    {.} -> indices
  
  data %>% 
    tibble::rownames_to_column() %>% # Add a temporary row index
    tidylog::filter(!(rowname %in% indices)) %>% # Filter out the bad rows and log actions
    dplyr::select(-rowname) %>% # Remove temporary row index
    {.} -> data
  
  attr(data, "data_errors") <- error_df # Set an attribute of the data containing errors found
  return(data) 
}

@tonyfischetti
Copy link
Owner

Heads up. I think the latest version of assertr might be able to help you.
Specifically, these functions that are now in the vignette thanks to @krystian8207

defect_report - For single rule and defective data it displays short info about skipping the current assertion. For chain_end sums up all skipped rules for defective data.

defect_df_return - For single rule and defective data it returns info data.frame about skipping current assertion. For chain_end returns all skipped rules info data.frame for defective data.

Let me know if this works for you. If not, please re-open the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants