-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Allow overrides for some file structure detection decisions #33630
[ML] Allow overrides for some file structure detection decisions #33630
Conversation
This change modifies the file structure detection functionality such that some of the decisions can be overridden with user supplied values. The fields that can be overridden are: - charset - format - has_header_row - column_names - delimiter - quote - should_trim_fields - grok_pattern - timestamp_field - timestamp_format If an override makes finding the file structure impossible then the endpoint will return an exception.
Pinging @elastic/ml-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, nothing major
FORMAT.getPreferredName() + " is " + FileStructure.Format.DELIMITED, validationException); | ||
} | ||
if (shouldTrimFields != null) { | ||
validationException = addValidationError(SHOULD_TRIM_FIELDS.getPreferredName() + " may only be specified if " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Excluding the first word the error message is duplicated and could be extracted.
@@ -116,8 +131,11 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String | |||
} | |||
|
|||
if (isHeaderInFile) { | |||
String quote = String.valueOf(csvPreference.getQuoteChar()); | |||
String twoQuotes = quote + quote; | |||
String optQuote = quote.replaceAll("([\\\\|()\\[\\]{}^$*?])", "\\\\$1") + "?"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like you are escaping the set of regex special characters but some are missing. .
, +
, :
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think :
is OK as it's only special within contexts that won't arise if the other characters are escaped. But I'll add .
and +
and remove the duplication with the following line.
@@ -39,6 +39,17 @@ protected RestChannelConsumer prepareRequest(RestRequest restRequest, NodeClient | |||
FindFileStructureAction.Request request = new FindFileStructureAction.Request(); | |||
request.setLinesToSample(restRequest.paramAsInt(FindFileStructureAction.Request.LINES_TO_SAMPLE.getPreferredName(), | |||
FileStructureFinderManager.DEFAULT_IDEAL_SAMPLE_LINE_COUNT)); | |||
request.setCharset(restRequest.param(FindFileStructureAction.Request.CHARSET.getPreferredName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xpack.ml.find_file_structure.json should be updated with the new params and perhaps a yml test if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
) This change modifies the file structure detection functionality such that some of the decisions can be overridden with user supplied values. The fields that can be overridden are: - charset - format - has_header_row - column_names - delimiter - quote - should_trim_fields - grok_pattern - timestamp_field - timestamp_format If an override makes finding the file structure impossible then the endpoint will return an exception.
* master: (24 commits) Only notify ready global checkpoint listeners (elastic#33690) Don't count hits via the collector if the hit count can be computed from index stats. (elastic#33701) Expose retries for CCR fetch failures (elastic#33694) Test fix - Graph vertices could appear in different orders based on map insertion sequence (elastic#33709) Structured audit logging (elastic#31931) Core: Add DateFormatter interface for java time parsing (elastic#33467) [CCR] Check whether the rejected execution exception has the shutdown flag set (elastic#33703) Mute ClusterDisruptionIT#testSendingShardFailure Revert "Mute FullClusterRestartSettingsUpgradeIT" Adjust BWC version on settings upgrade test (elastic#33650) [ML] Allow overrides for some file structure detection decisions (elastic#33630) Adapt skip version for doc_values format deprecation [TEST] wait for no initializing shards [Docs] Minor fix in `has_child` javadoc comment (elastic#33674) Mute FullClusterRestartSettingsUpgradeIT [Kerberos] Add realm name & UPN to user metadata (elastic#33338) [TESTS] Disable specific locales for RestrictedTrustManagerTest (elastic#33299) SQL: Return functions in JDBC driver metadata (elastic#33672) SCRIPTING: Move terms_set Context to its Own Class (elastic#33602) AwaitsFix testRestoreMinmal ...
This change modifies the file structure detection functionality
such that some of the decisions can be overridden with user
supplied values.
The fields that can be overridden are:
If an override makes finding the file structure impossible then
the endpoint will return an exception.