Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing Map for error columns and their values #85

Conversation

benedeki
Copy link
Contributor

@benedeki benedeki commented Mar 14, 2023

  • Relatively big overwrite to use map instead of errCol and sequence of raw values
  • ErrorMessageSubmitJustErrorValue class created to offer the ability to submit errors without source column but with error value
  • Some other changes to improve the code

After the presentation, discussions with others and more thoughts, I came to the idea to combine the errCol (which would be better as sequence anyway) and rawValues into one column of map data type. That resulted in a need to do some more exercise in the supporting types, but I believe to for the better.

Also some general code improvements were added as I stumbled upon stuff during the refactoring.

…e of raw values

* `ErrorMessageSubmitJustErrorValue` class created to offer the ability to submit errors without source column but with error value
errorMessageSubmit.additionInfo.column
)
}
}

object EvaluateViaUdf {
type ErrorMessageFunction[T] = (ErrType, ErrCode, ErrMsg, ErrCol, RawValues, AdditionalInfo) => T //TODO needed?
type ErrorMessageFunction[T] = (ErrType, ErrCode, ErrMsg, ErrColsAndValues, AdditionalInfo) => T //TODO needed?
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please a new line after line 43

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue here it's really not needed.

val additionInfo = "additionInfo"
val errCols = "errCols"
val errValues = "errValues"
}

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new line at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a newline at the end (after line 58)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above 😉

ColumnOrValue.withOption(additionalInfo)
)
}
}
Copy link
Contributor

@TebaleloS TebaleloS Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line after line 44

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's there. File has 45 lines. .editorconfig file is actually set to ensure it's always there.

ColumnOrValue.withValue(errType),
ColumnOrValue.withValue(errCode),
ColumnOrValue.withValue(errMessage),
errColName,
ColumnOrValue.withOption(additionalInfo)
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new line after line 39

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do that, if it improves reading. But to be honest I probably forget more often 😊

val isColumn: Boolean = false
val isValue: Boolean = true
val getColumnName: Option[String] = None
val columnNames: Set[String] = Set.empty
val getValue: Option[T] = None
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔁

//mapping is missing, should be part of AdditionalInfo, as being very specific

type ColumnTransformer = String => Column

//This is to ensure some level of type-safety
final case class ErrorColumn(column: Column) extends AnyVal
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔁

lit(None.orNull).cast(dataType)
}


}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new line at the end and remove unnecessary new line (either line 70 or 71)

Copy link
Contributor

@TebaleloS TebaleloS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else look cool to me except for minor stuff about the unnecessary new lines and missing new lines at the end of every add file.

@@ -18,18 +18,30 @@ package za.co.absa.spark.commons.errorhandling.partials

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.struct
import za.co.absa.spark.commons.errorhandling.ErrorMessageSubmit
import za.co.absa.spark.commons.errorhandling.{ErrorMessage, ErrorMessageSubmit}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import za.co.absa.spark.commons.errorhandling.{ErrorMessage, ErrorMessageSubmit}
import za.co.absa.spark.commons.errorhandling.ErrorMessageSubmit

Removed import not used in code.

Copy link
Contributor

@miroslavpojer miroslavpojer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pulled
  • build
  • code review

Copy link

@jirifilip jirifilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, David, nice code. I attached my comments but don't take them too seriously, I don't have that much experience in Scala 😃

* @param additionInfo - Sequence of Mappings i.e Mapping Table Column -> Equivalent Mapped Dataset column
* @param errColsAndValues - The names of the columns where the error occurred and their raw values (which are the
* potential culprits of the error)
* @param additionInfo - any optional additional information in the form of a JSON string
*/
case class ErrorMessage(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if it wouldn't be better to use the types directly instead of type aliases. These type aliases might add some overhead for the users. I'm a bit stupid so I would be constantly forgetting what type ErrType really is 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason why I used type aliases was that I wasn't sure we are settled on the types. This way it is rather easy to change.

Also the AdditionalInfo actually is planned to be more then alias eventually (#90)

I am happy to discuss the other aliases once the thing is settled (aliases should be easy to remove) 👍

dataFrame.withColumn(errorColName, reMap(array_union(deMap(col(errorColName)), colToUnion)))
}

protected def doTheAggregation(dataFrame: DataFrame, errCols: Column*): DataFrame = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if we should use the word "aggregation" here and also in ErrorHandling trait. So I searched the dictionary for the definition of the word and it fits here but when I first started reading the code, I thought that there would be some df.groupBy(...).agg(...), i.e. some squishing of rows and then some calculation 😄 But if I understood correctly, it's more for combining the errors into one column (or actually, I guess it depends on the implementation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Actually you inspired me, as it's a little an internal function it can be little more eloquent. Hopefully it will improve the understanding. 😉

* `ErrorMessageSubmitWithoutColumn` changed from `case class` to `class` to allow inheritance
* some PR comments addressed
@benedeki
Copy link
Contributor Author

Moving the the main PR, general direction approved in direct discussions.

@benedeki benedeki merged commit 8e747e7 into feature/83-create-a-spike-for-error-handling Mar 16, 2023
@benedeki benedeki deleted the feature/83-create-a-spike-for-error-handling-map-errcol branch March 16, 2023 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants