-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-33278 Fix //version race condition in parquetType.ecl #19440
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Jack Del Vecchio <[email protected]>
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-33278 Jirabot Action Result: |
OUTPUT(qStringResult, NAMED('QStringTest'), OVERWRITE), | ||
OUTPUT(utf8Result, NAMED('UTF8Test'), OVERWRITE), | ||
OUTPUT(unicodeResult, NAMED('UnicodeTest'), OVERWRITE) | ||
SEQUENTIAL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you reorganise the write file then read it back parallel for each type from previous version to this parallel write and sequential read? Is the ParquetIO.Read() not thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was because I was seeing some weird behavior with the SEQUENTIAL call. Before I explicitly ordered the Parquet.Write actions to come before the OUTPUTs, the DeleteExternalFile actions would occur first despite coming after the OUTPUTs in the list of SEQUENTIAL arguments. This meant the cleanup of the files happened before they were written and read from leaving behind every file that was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Read is thread safe, but the reason for the sequential read is because the result order needs to be the same to match the key file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. Fair enough.
This Morning I found 2 more errors related to parquetType test:
557. parquetTypes(compressionType='GZip') Error: 0: parquet: IOError: Couldn't deserialize thrift: TProtocolException: Invalid data
558. parquetTypes(compressionType='Brotli') Error: 0: parquet: IOError: Couldn't deserialize thrift: TProtocolException: Invalid data
What do you think about are those errors related to versions race condition as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so. I think that would be the result of reading from a file while something is in the middle of writing a record batch to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to merge.
import ^ as root; | ||
compressionType := #IFDEFINED(root.compressionType, 'UNCOMPRESSED'); | ||
|
||
IMPORT Std; | ||
IMPORT Parquet; | ||
|
||
dropzoneDirectory := Std.File.GetDefaultDropZone(); | ||
dropzoneDirectory := Std.File.GetDefaultDropZone() + '/regress/parquet/' + WORKUNIT + '-'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just pondering about is this name (dropzoneDirectory) still correct or not, because now it contains not only the default DZ path but a file path and prefix as well. I feel it is a bit misleading, but hopefully not a big deal.
Type of change:
Checklist:
Smoketest:
Testing: