-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing Map for error columns and their values #85
Introducing Map for error columns and their values #85
Conversation
…e of raw values * `ErrorMessageSubmitJustErrorValue` class created to offer the ability to submit errors without source column but with error value
errorMessageSubmit.additionInfo.column | ||
) | ||
} | ||
} | ||
|
||
object EvaluateViaUdf { | ||
type ErrorMessageFunction[T] = (ErrType, ErrCode, ErrMsg, ErrCol, RawValues, AdditionalInfo) => T //TODO needed? | ||
type ErrorMessageFunction[T] = (ErrType, ErrCode, ErrMsg, ErrColsAndValues, AdditionalInfo) => T //TODO needed? | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please a new line after line 43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue here it's really not needed.
val additionInfo = "additionInfo" | ||
val errCols = "errCols" | ||
val errValues = "errValues" | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a new line at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a newline at the end (after line 58)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above 😉
ColumnOrValue.withOption(additionalInfo) | ||
) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line after line 44
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's there. File has 45 lines. .editorconfig
file is actually set to ensure it's always there.
ColumnOrValue.withValue(errType), | ||
ColumnOrValue.withValue(errCode), | ||
ColumnOrValue.withValue(errMessage), | ||
errColName, | ||
ColumnOrValue.withOption(additionalInfo) | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a new line after line 39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do that, if it improves reading. But to be honest I probably forget more often 😊
val isColumn: Boolean = false | ||
val isValue: Boolean = true | ||
val getColumnName: Option[String] = None | ||
val columnNames: Set[String] = Set.empty | ||
val getValue: Option[T] = None | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔁
//mapping is missing, should be part of AdditionalInfo, as being very specific | ||
|
||
type ColumnTransformer = String => Column | ||
|
||
//This is to ensure some level of type-safety | ||
final case class ErrorColumn(column: Column) extends AnyVal | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔁
lit(None.orNull).cast(dataType) | ||
} | ||
|
||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a new line at the end and remove unnecessary new line (either line 70 or 71)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything else look cool to me except for minor stuff about the unnecessary new lines and missing new lines at the end of every add file.
@@ -18,18 +18,30 @@ package za.co.absa.spark.commons.errorhandling.partials | |||
|
|||
import org.apache.spark.sql.Column | |||
import org.apache.spark.sql.functions.struct | |||
import za.co.absa.spark.commons.errorhandling.ErrorMessageSubmit | |||
import za.co.absa.spark.commons.errorhandling.{ErrorMessage, ErrorMessageSubmit} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import za.co.absa.spark.commons.errorhandling.{ErrorMessage, ErrorMessageSubmit} | |
import za.co.absa.spark.commons.errorhandling.ErrorMessageSubmit |
Removed import not used in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- pulled
- build
- code review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, David, nice code. I attached my comments but don't take them too seriously, I don't have that much experience in Scala 😃
* @param additionInfo - Sequence of Mappings i.e Mapping Table Column -> Equivalent Mapped Dataset column | ||
* @param errColsAndValues - The names of the columns where the error occurred and their raw values (which are the | ||
* potential culprits of the error) | ||
* @param additionInfo - any optional additional information in the form of a JSON string | ||
*/ | ||
case class ErrorMessage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if it wouldn't be better to use the types directly instead of type aliases. These type aliases might add some overhead for the users. I'm a bit stupid so I would be constantly forgetting what type ErrType
really is 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason why I used type aliases was that I wasn't sure we are settled on the types. This way it is rather easy to change.
Also the AdditionalInfo
actually is planned to be more then alias eventually (#90)
I am happy to discuss the other aliases once the thing is settled (aliases should be easy to remove) 👍
dataFrame.withColumn(errorColName, reMap(array_union(deMap(col(errorColName)), colToUnion))) | ||
} | ||
|
||
protected def doTheAggregation(dataFrame: DataFrame, errCols: Column*): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if we should use the word "aggregation" here and also in ErrorHandling
trait. So I searched the dictionary for the definition of the word and it fits here but when I first started reading the code, I thought that there would be some df.groupBy(...).agg(...)
, i.e. some squishing of rows and then some calculation 😄 But if I understood correctly, it's more for combining the errors into one column (or actually, I guess it depends on the implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Actually you inspired me, as it's a little an internal function it can be little more eloquent. Hopefully it will improve the understanding. 😉
* `ErrorMessageSubmitWithoutColumn` changed from `case class` to `class` to allow inheritance * some PR comments addressed
Moving the the main PR, general direction approved in direct discussions. |
map
instead of errCol and sequence of raw valuesErrorMessageSubmitJustErrorValue
class created to offer the ability to submit errors without source column but with error valueAfter the presentation, discussions with others and more thoughts, I came to the idea to combine the
errCol
(which would be better as sequence anyway) andrawValues
into one column ofmap
data type. That resulted in a need to do some more exercise in the supporting types, but I believe to for the better.Also some general code improvements were added as I stumbled upon stuff during the refactoring.