-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/add data checks #179
base: master
Are you sure you want to change the base?
Fix/add data checks #179
Conversation
…x/add_data_checks
…add_data_checks
…add_data_checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks pretty good, my only main comment is to not duplicate the logging. I think having the warnings/errors in the check functions and then raise the exception in the mapdata is good.
I would also consider moving the notebooks into the documentation examples because it probably would be useful for people to be able to review this.
Geopandas make_valid could be used before checking for validity. Or maybe as a configuration option?
@AngRodrigues let me know what changes you want to make and i can do the rest.
if not geodata.geometry.is_valid.all(): | ||
logger.error(f"Invalid geometries found in datatype {datatype_name}. Please fix them before proceeding.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile trying to make the geometry valid?
https://geopandas.org/en/v1.0.1/docs/reference/api/geopandas.GeoSeries.make_valid.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure exactly what make_valid does, so it could potentially make a change we don't want but maybe as a config option
if geodata[id_column].isnull().any(): | ||
error_msg = ( | ||
f"Datatype {geodata_name}: Column '{id_column}' " | ||
f"(config key: '{id_config_key}') contains non-numeric or NaN values. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, we could make the error message even better by providing the strings (or some of) for the values that couldn't be coerced
# else: | ||
# warning_msg = ( | ||
# f"Datatype {datatype_name.upper()}: Optional column '{column_name}' " | ||
# f"(config key: '{config_key}') is missing from the data. " | ||
# ) | ||
####### this might be taking it a bit too far | ||
|
||
# logger.info(warning_msg) | ||
# warnings.append(warning_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keep that as an info log, but not as a warning. Good to have the extra information avaliable if we need it
elif datatype == Datatype.FAULT: | ||
validity_check, message = check_fault_fields_validity(mapdata = self) | ||
if validity_check: | ||
logger.error(f"Datatype FAULT - data validation failed: {message}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is too much duplication in the error messages. The validity checker prints out nice error messages. I think it would be enough to just raise the exception Datatype FAULT - data validation failed
"bounding_box": bounding_box, | ||
"working_projection": working_projection, # this may be removed when fix is added for https://github.com/Loop3D/map2loop/issues/103 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could/should we calculate the bounding box from the extent of the shapefiles?
Description
added
data_checks
, a series of functions that check the main datatypes in map2loop: Geology, Structure, Faults and Folds. Most of the code was refactored to avoid code repetition between the different functions, but checks that are valid for one single datatype are kept in the same functionsAlso changed the config dictionary data checks into this file, and the functionality of not allowing invalid args into project (from issue 162), as it made sense to have these fixes together. Please check PR #171 for more info on this.
Fixes #162
Fixes #138
Type of change
How Has This Been Tested?
I am attaching 3 notebooks that show some of the common warnings and errors.
Geology - https://gist.github.com/AngRodrigues/a273380a082d024669699d13fa99a0f8
Structure - https://gist.github.com/AngRodrigues/7e7c5b9bb178cb89f1f815c3392db417
Fault - https://gist.github.com/AngRodrigues/6db7a0eca5421b90f4c79ae6021f69f6
Also checked this with the notebooks from source data and Loop server data and it seems to work well.
Branch:
fix/add_data_checks