HPCC-33278 Fix //version race condition in parquetType.ecl #19440

jackdelv · 2025-01-21T19:19:06Z

Add workunit id to filename to avoid multiple threads reading/writing to the same file
Clean up temporary files
Explicitly order actions to allow files to be deleted last

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

Signed-off-by: Jack Del Vecchio <[email protected]>

github-actions · 2025-01-21T19:19:28Z

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-33278

Jirabot Action Result:
Workflow Transition To: Merge Pending
Updated PR

AttilaVamos · 2025-01-22T12:02:10Z

testing/regress/ecl/parquetTypes.ecl

-    OUTPUT(qStringResult, NAMED('QStringTest'), OVERWRITE),
-    OUTPUT(utf8Result, NAMED('UTF8Test'), OVERWRITE),
-    OUTPUT(unicodeResult, NAMED('UnicodeTest'), OVERWRITE)
+SEQUENTIAL(


Why did you reorganise the write file then read it back parallel for each type from previous version to this parallel write and sequential read? Is the ParquetIO.Read() not thread safe?

It was because I was seeing some weird behavior with the SEQUENTIAL call. Before I explicitly ordered the Parquet.Write actions to come before the OUTPUTs, the DeleteExternalFile actions would occur first despite coming after the OUTPUTs in the list of SEQUENTIAL arguments. This meant the cleanup of the files happened before they were written and read from leaving behind every file that was used.

The Read is thread safe, but the reason for the sequential read is because the result order needs to be the same to match the key file.

OK, I see. Fair enough.

This Morning I found 2 more errors related to parquetType test:

557. parquetTypes(compressionType='GZip') Error: 0: parquet: IOError: Couldn't deserialize thrift: TProtocolException: Invalid data 558. parquetTypes(compressionType='Brotli') Error: 0: parquet: IOError: Couldn't deserialize thrift: TProtocolException: Invalid data

What do you think about are those errors related to versions race condition as well?

I believe so. I think that would be the result of reading from a file while something is in the middle of writing a record batch to it.

AttilaVamos

I think it is good to merge.

AttilaVamos · 2025-01-22T14:34:49Z

testing/regress/ecl/parquetTypes.ecl

 import ^ as root;
 compressionType := #IFDEFINED(root.compressionType, 'UNCOMPRESSED');

 IMPORT Std;
 IMPORT Parquet;

-dropzoneDirectory := Std.File.GetDefaultDropZone();
+dropzoneDirectory := Std.File.GetDefaultDropZone() + '/regress/parquet/' + WORKUNIT + '-';


I am just pondering about is this name (dropzoneDirectory) still correct or not, because now it contains not only the default DZ path but a file path and prefix as well. I feel it is a bit misleading, but hopefully not a big deal.

HPCC-33278 Fix //version race condition in parquetType.ecl

b2c2127

Signed-off-by: Jack Del Vecchio <[email protected]>

jackdelv requested a review from AttilaVamos January 21, 2025 19:19

AttilaVamos reviewed Jan 22, 2025

View reviewed changes

jackdelv requested a review from AttilaVamos January 29, 2025 16:53

AttilaVamos approved these changes Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC-33278 Fix //version race condition in parquetType.ecl #19440

HPCC-33278 Fix //version race condition in parquetType.ecl #19440

jackdelv commented Jan 21, 2025 •

edited

Loading

github-actions bot commented Jan 21, 2025

AttilaVamos Jan 22, 2025

jackdelv Jan 22, 2025

jackdelv Jan 22, 2025

AttilaVamos Jan 22, 2025

jackdelv Jan 22, 2025

AttilaVamos Jan 22, 2025

AttilaVamos left a comment

AttilaVamos Jan 22, 2025

HPCC-33278 Fix //version race condition in parquetType.ecl #19440

Are you sure you want to change the base?

HPCC-33278 Fix //version race condition in parquetType.ecl #19440

Conversation

jackdelv commented Jan 21, 2025 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

github-actions bot commented Jan 21, 2025

AttilaVamos Jan 22, 2025

Choose a reason for hiding this comment

jackdelv Jan 22, 2025

Choose a reason for hiding this comment

jackdelv Jan 22, 2025

Choose a reason for hiding this comment

AttilaVamos Jan 22, 2025

Choose a reason for hiding this comment

jackdelv Jan 22, 2025

Choose a reason for hiding this comment

AttilaVamos Jan 22, 2025

Choose a reason for hiding this comment

AttilaVamos left a comment

Choose a reason for hiding this comment

AttilaVamos Jan 22, 2025

Choose a reason for hiding this comment

jackdelv commented Jan 21, 2025 •

edited

Loading