Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Add support for ORC reader #23037

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

wypb
Copy link
Contributor

@wypb wypb commented Jun 20, 2024

Description

We have recently merged the PR for reading ORC statistics and implementing OrcReader based on DwrfReader on the velox side. Now it is time to add support for ORC reader it in Prestissimo.

@wypb wypb requested a review from a team as a code owner June 20, 2024 08:26
@wypb wypb force-pushed the orc_reader branch 3 times, most recently from 55a8d5b to 7325337 Compare June 21, 2024 01:52
@tdcmeehan tdcmeehan self-assigned this Jun 23, 2024
@wypb wypb changed the title [native] Add support for ORC reader and add orc native tests [native] Add support for ORC reader Jun 25, 2024
@wypb
Copy link
Contributor Author

wypb commented Jun 25, 2024

Hi @majetideepak @aditi-pandit could you please help review this PR? Thanks!

@majetideepak
Copy link
Collaborator

@wypb can you add some end-to-end tests? Thanks!

@aditi-pandit
Copy link
Contributor

@wypb : Would be great to use ORC with the QueryRunners (https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/PrestoNativeQueryRunnerUtils.java) in an e2e test. The test should highlight differences of ORC wrt Parquet, demonstrate filter pushdown as well. Using ORC with Hive and as a format with Iceberg is perfect.

@wypb
Copy link
Contributor Author

wypb commented Jun 26, 2024

Hi @majetideepak @aditi-pandit I added TPCH tests for ORC, including the Iceberg data source. The TPCDS test for ORC is not added because some types of Velox's ORC reader currently do not implement fast path, which will cause exceptions when reading data.

Caused by: java.lang.RuntimeException: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:777)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:745)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
	... 30 more
Caused by: VeloxRuntimeError: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
	at Unknown.# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_(Unknown Source)
	at Unknown.# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
	at Unknown.# 3  _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_(Unknown Source)
	at Unknown.# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
	at Unknown.# 5  _ZN8facebook5velox4dwio6common21SelectiveColumnReader7addNullIiEEvv(Unknown Source)
	at Unknown.# 6  _ZN8facebook5velox4dwio6common15ExtractToReader7addNullIiEEvi(Unknown Source)
	at Unknown.# 7  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE7addNullEv(Unknown Source)
	at Unknown.# 8  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE19filterPassedForNullEv(Unknown Source)
	at Unknown.# 9  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE11processNullERb(Unknown Source)
	at Unknown.# 10 _ZN8facebook5velox4dwrf12RleDecoderV2ILb0EE15readWithVisitorILb1ENS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS6_15ExtractToReaderELb1EEEEEvPKmT0_(Unknown Source)
	at Unknown.# 11 _ZN8facebook5velox4dwio6common21SelectiveColumnReader17decodeWithVisitorINS0_4dwrf12RleDecoderV2ILb0EEENS2_29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EEEEEvPNS2_10IntDecoderIXsrT_9kIsSignedEEERT0_(Unknown Source)
	at Unknown.# 12 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader15readWithVisitorINS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS5_15ExtractToReaderELb1EEEEEvN5folly5RangeIPKiEET_(Unknown Source)
	at Unknown.# 13 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader10readHelperINS0_6common10AlwaysTrueELb1ENS0_4dwio6common15ExtractToReaderEEEvPNS4_6FilterEN5folly5RangeIPKiEET1_(Unknown Source)
	at Unknown.# 14 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader13processFilterILb1ENS0_4dwio6common15ExtractToReaderEEEvPNS0_6common6FilterEN5folly5RangeIPKiEET0_(Unknown Source)
	at Unknown.# 15 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader4readEiN5folly5RangeIPKiEEPKm(Unknown Source)
	at Unknown.# 16 _ZN8facebook5velox4dwio6common12ColumnLoader12loadInternalEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 17 _ZN8facebook5velox12VectorLoader4loadEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 18 _ZN8facebook5velox12VectorLoader12loadInternalERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 19 _ZN8facebook5velox12VectorLoader4loadERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 20 _ZNK8facebook5velox10LazyVector18loadVectorInternalEv(Unknown Source)
	at Unknown.# 21 _ZNK8facebook5velox10LazyVector18loadedVectorSharedEv(Unknown Source)
	at Unknown.# 22 _ZNK8facebook5velox10LazyVector12loadedVectorEv(Unknown Source)
	at Unknown.# 23 _ZN8facebook5velox10serializer6presto17PrestoVectorSerde22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 24 _ZN8facebook5velox17VectorStreamGroup22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 25 _ZN8facebook5velox4exec17PartitionedOutput16estimateRowSizesEv(Unknown Source)
	at Unknown.# 26 _ZN8facebook5velox4exec17PartitionedOutput8addInputESt10shared_ptrINS0_9RowVectorEE(Unknown Source)
	at Unknown.# 27 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
	at Unknown.# 28 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
	at Unknown.# 29 _ZZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS2_EENKUlvE_clEv(Unknown Source)
	at Unknown.# 30 _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 31 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 32 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
	at Unknown.# 33 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
	at Unknown.# 34 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
	at Unknown.# 35 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
	at Unknown.# 36 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
	at Unknown.# 37 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
	at Unknown.# 38 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 39 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 40 _ZZN5folly18NamedThreadFactory9newThreadEONS_8FunctionIFvvEEEENUlvE_clEv(Unknown Source)
	at Unknown.# 41 _ZSt13__invoke_implIvZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_(Unknown Source)
	at Unknown.# 42 _ZSt8__invokeIZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS8_DpOS9_(Unknown Source)
	at Unknown.# 43 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE(Unknown Source)
	at Unknown.# 44 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEEclEv(Unknown Source)
	at Unknown.# 45 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS3_8FunctionIFvvEEEEUlvE_EEEEE6_M_runEv(Unknown Source)
	at Unknown.# 46 0x00000000000c2b23(Unknown Source)
	at Unknown.# 47 start_thread(Unknown Source)
	at Unknown.# 48 clone(Unknown Source)

@aditi-pandit
Copy link
Contributor

@wypb : Your code looks fine. When I search for ORC in the presto-native-execution directory I also see the following usage.

https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestWriter.java#L71 needs a fix as well

Please can you check about it.

@wypb wypb force-pushed the orc_reader branch 2 times, most recently from 0d3570c to 9615017 Compare June 28, 2024 09:32
@wypb
Copy link
Contributor Author

wypb commented Jun 28, 2024

Good catch, thank you @aditi-pandit I've fixed it.

@wypb
Copy link
Contributor Author

wypb commented Jun 28, 2024

@aditi-pandit I looked at the code again and found that this should not be removed. testCreateTableWithUnsupportedFormats is used to test the Velox ORC writer, and Velox currently does not support ORC writing.

@@ -73,6 +73,7 @@ private static void createTpcdsCallCenter(QueryRunner queryRunner, Session sessi
if (!queryRunner.tableExists(session, "call_center")) {
switch (storageFormat) {
case "PARQUET":
case "ORC":
Copy link
Contributor

@aditi-pandit aditi-pandit Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DWRF does not support the DATE type, but Velox queries ORC's DATE type using SelectiveIntegerDirectColumnReader. My test shows that the DATE type data can be read correctly. So I don't think it is necessary to convert the DATE type to VARCHAR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently added a parameter to createAllTables to not do the DATE -> VARCHAR casting for TPCH tables https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/NativeQueryRunnerUtils.java#L69. You can use it in your tests.

@aditi-pandit
Copy link
Contributor

aditi-pandit commented Jul 2, 2024

Hi @majetideepak @aditi-pandit I added TPCH tests for ORC, including the Iceberg data source. The TPCDS test for ORC is not added because some types of Velox's ORC reader currently do not implement fast path, which will cause exceptions when reading data.

Caused by: java.lang.RuntimeException: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:777)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:745)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
	... 30 more
Caused by: VeloxRuntimeError: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
	at Unknown.# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_(Unknown Source)
	at Unknown.# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
	at Unknown.# 3  _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_(Unknown Source)
	at Unknown.# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
	at Unknown.# 5  _ZN8facebook5velox4dwio6common21SelectiveColumnReader7addNullIiEEvv(Unknown Source)
	at Unknown.# 6  _ZN8facebook5velox4dwio6common15ExtractToReader7addNullIiEEvi(Unknown Source)
	at Unknown.# 7  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE7addNullEv(Unknown Source)
	at Unknown.# 8  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE19filterPassedForNullEv(Unknown Source)
	at Unknown.# 9  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE11processNullERb(Unknown Source)
	at Unknown.# 10 _ZN8facebook5velox4dwrf12RleDecoderV2ILb0EE15readWithVisitorILb1ENS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS6_15ExtractToReaderELb1EEEEEvPKmT0_(Unknown Source)
	at Unknown.# 11 _ZN8facebook5velox4dwio6common21SelectiveColumnReader17decodeWithVisitorINS0_4dwrf12RleDecoderV2ILb0EEENS2_29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EEEEEvPNS2_10IntDecoderIXsrT_9kIsSignedEEERT0_(Unknown Source)
	at Unknown.# 12 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader15readWithVisitorINS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS5_15ExtractToReaderELb1EEEEEvN5folly5RangeIPKiEET_(Unknown Source)
	at Unknown.# 13 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader10readHelperINS0_6common10AlwaysTrueELb1ENS0_4dwio6common15ExtractToReaderEEEvPNS4_6FilterEN5folly5RangeIPKiEET1_(Unknown Source)
	at Unknown.# 14 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader13processFilterILb1ENS0_4dwio6common15ExtractToReaderEEEvPNS0_6common6FilterEN5folly5RangeIPKiEET0_(Unknown Source)
	at Unknown.# 15 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader4readEiN5folly5RangeIPKiEEPKm(Unknown Source)
	at Unknown.# 16 _ZN8facebook5velox4dwio6common12ColumnLoader12loadInternalEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 17 _ZN8facebook5velox12VectorLoader4loadEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 18 _ZN8facebook5velox12VectorLoader12loadInternalERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 19 _ZN8facebook5velox12VectorLoader4loadERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 20 _ZNK8facebook5velox10LazyVector18loadVectorInternalEv(Unknown Source)
	at Unknown.# 21 _ZNK8facebook5velox10LazyVector18loadedVectorSharedEv(Unknown Source)
	at Unknown.# 22 _ZNK8facebook5velox10LazyVector12loadedVectorEv(Unknown Source)
	at Unknown.# 23 _ZN8facebook5velox10serializer6presto17PrestoVectorSerde22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 24 _ZN8facebook5velox17VectorStreamGroup22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 25 _ZN8facebook5velox4exec17PartitionedOutput16estimateRowSizesEv(Unknown Source)
	at Unknown.# 26 _ZN8facebook5velox4exec17PartitionedOutput8addInputESt10shared_ptrINS0_9RowVectorEE(Unknown Source)
	at Unknown.# 27 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
	at Unknown.# 28 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
	at Unknown.# 29 _ZZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS2_EENKUlvE_clEv(Unknown Source)
	at Unknown.# 30 _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 31 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 32 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
	at Unknown.# 33 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
	at Unknown.# 34 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
	at Unknown.# 35 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
	at Unknown.# 36 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
	at Unknown.# 37 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
	at Unknown.# 38 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 39 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 40 _ZZN5folly18NamedThreadFactory9newThreadEONS_8FunctionIFvvEEEENUlvE_clEv(Unknown Source)
	at Unknown.# 41 _ZSt13__invoke_implIvZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_(Unknown Source)
	at Unknown.# 42 _ZSt8__invokeIZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS8_DpOS9_(Unknown Source)
	at Unknown.# 43 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE(Unknown Source)
	at Unknown.# 44 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEEclEv(Unknown Source)
	at Unknown.# 45 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS3_8FunctionIFvvEEEEUlvE_EEEEE6_M_runEv(Unknown Source)
	at Unknown.# 46 0x00000000000c2b23(Unknown Source)
	at Unknown.# 47 start_thread(Unknown Source)
	at Unknown.# 48 clone(Unknown Source)

@wypb : Had a question about this point you raised... You are saying that HiveQueryRunner can't read TPC-DS tables, but handles. That seems odd. Did you look deeper into what TPC-H is doing different ? The main difference is that in TPC-H all date columns were exposed as VARCHAR. But wonder if there is anything else ? Would be great to see which particular column here is problematic.

@tdcmeehan
Copy link
Contributor

@wypb is this PR still being worked on?

@wypb
Copy link
Contributor Author

wypb commented Aug 26, 2024

Hi @tdcmeehan sorry for the late reply.

Yes, I'm still keeping an eye on this. I've been working on a few Velox PRs lately, so I haven't had time to work on this yet. I'll update this PR later this week.

@tdcmeehan
Copy link
Contributor

Let's add ORC as a supported file format in Supported Use Cases (we can also mention that Parquet is a supported format).

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wypb. Have a question:

As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?

@Override
protected ExpectedQueryRunner createExpectedQueryRunner() throws Exception
{
this.storageFormat = "ORC";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a member variable and the same value used in all methods, you can initialize it at the class level outside the methods and use this.storageFormat each place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already refactored, thank you.

@steveburnett
Copy link
Contributor

Should this be mentioned in the doc, maybe in Supported Use Cases or Presto C++ Features?

@wypb
Copy link
Contributor Author

wypb commented Sep 4, 2024

Hi @tdcmeehan, @aditi-pandit sorry for the late reply.

@wypb : Had a question about this point you raised... You are saying that HiveQueryRunner can't read TPC-DS tables, but handles. That seems odd. Did you look deeper into what TPC-H is doing different ? The main difference is that in TPC-H all date columns were exposed as VARCHAR. But wonder if there is anything else ? Would be great to see which particular column here is problematic.

I was also curious about this question before, but I didn't check the reason. Today I checked why most of the TPCDS queries failed, while all the TPCH queries passed. I debugged the code and found that the integer fields of the TPCDS table (such as the cs_sold_date_sk field of the catalog_sales table) may be NULL, and Velox does not implement the fastpath logic for integer fields encoded as RLEv2 in ORC. These two reasons combined cause most of the TPCDS queries to fail. The TPCH table fields will not be NULL, so this exception will not be triggered.

For related code, see SelectiveColumnReader::prepareNulls
https://github.com/facebookincubator/velox/blob/main/velox/dwio/common/SelectiveColumnReader.cpp#L103-L129

void SelectiveColumnReader::prepareNulls(
    RowSet rows,
    bool hasNulls,
    int32_t extraRows) {
  if (!hasNulls) {
    anyNulls_ = false;
    return;
  }
  initReturnReaderNulls(rows);
  if (returnReaderNulls_) {
    // No need for null flags if fast path.
    return;
  }
  auto numRows = rows.size() + extraRows;
  if (resultNulls_ && resultNulls_->unique() &&
      resultNulls_->capacity() >= bits::nbytes(numRows) + simd::kPadding) {
    resultNulls_->setSize(bits::nbytes(numRows));
  } else {
    resultNulls_ = AlignedBuffer::allocate<bool>(
        numRows + (simd::kPadding * 8), &memoryPool_);
    rawResultNulls_ = resultNulls_->asMutable<uint64_t>();
  }
  anyNulls_ = false;
  // Clear whole capacity because future uses could hit uncleared data between
  // capacity() and 'numBytes'.
  simd::memset(rawResultNulls_, bits::kNotNullByte, resultNulls_->capacity());
}

For the TPCH table, hasNulls is false, so there is no need to initialize rawResultNulls_, and SelectiveColumnReader#addNull() will not be called later (there is a VELOX_DCHECK(rawResultNulls_ && rawValues_) in it, which causes the query of the TPCDS table to report an exception); for the TPCDS table, hasNulls is true, and then SelectiveColumnReader::initReturnReaderNulls is executed, returnReaderNulls_ = true is calculated, and then it returns. In SelectiveColumnReader::prepareNulls, rawResultNulls_ will not be initialized, which causes an exception in the subsequent call to SelectiveColumnReader#addNull().

void SelectiveColumnReader::initReturnReaderNulls(RowSet rows) {
  if (useBulkPath() && !scanSpec_->hasFilter()) {
    anyNulls_ = nullsInReadRange_ != nullptr;
    bool isDense = rows.back() == rows.size() - 1;
    returnReaderNulls_ = anyNulls_ && isDense;
  } else {
    returnReaderNulls_ = false;
  }
}

If we modify the implementation of SelectiveIntegerDirectColumnReader#hasBulkPath() to the following logic, the TPCDS query will also succeed.

  bool hasBulkPath() const override {
    return format == DwrfFormat::kOrc && version == RleVersion_2 ? false : true;
  }

@wypb
Copy link
Contributor Author

wypb commented Sep 4, 2024

As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?

DWRF does not support the DATE type, but Velox queries ORC's DATE type using SelectiveIntegerDirectColumnReader. My test shows that the DATE type data can be read correctly. So I don't think it is necessary to convert the DATE type to VARCHAR.

@wypb wypb force-pushed the orc_reader branch 2 times, most recently from 1df275c to c6542f0 Compare February 10, 2025 05:46
@wypb
Copy link
Contributor Author

wypb commented Feb 10, 2025

@aditi-pandit sorry for the late reply. I have synchronized the latest branch code and all the tests are passed.

@aditi-pandit
Copy link
Contributor

aditi-pandit commented Feb 10, 2025

Thanks @wypb. Will work on this PR review and getting it merged :)

In your description you have:

"NOTE: Because Presto uses RLEv2 encoding to write ORC files, and some types of Velox ORC readers do not implement fast path readers, which will cause exceptions when Velox reads ORC, so end-to-end tests for TPCDS in ORC are not added here. Once Velox implements fast path readers for ORC RLEv2 encoding, we need to add ORC tests."

So we disabled fast-path reading for ORC types in linked PR. Though we should still follow up on the fast-path reading. Do we have an issue for this ? Would you create one ?

@aditi-pandit
Copy link
Contributor

@wypb : Had one more question about this change.. facebookincubator/velox#10939 was submit for fixing TPC-DS ORC tests. Though doesn't seem like TPC-DS tests are added in this PR. Could you add them ? Then this PR looks good for submission.

steveburnett
steveburnett previously approved these changes Feb 10, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local docs build, looks good. Thanks!

@wypb
Copy link
Contributor Author

wypb commented Feb 11, 2025

@aditi-pandit I added the orc tpcds test. The fast path problem I encountered before has been solved in facebookincubator/velox#10939. However, I have encountered a new problem now. When reading ORC Decimal with a decimal filter, the VELOX_CHECK(!scanSpec_->filter()); in velox
SelectiveDecimalColumnReader will fail. I will go to Velox to solve this problem first.

Caused by: VeloxRuntimeError: !scanSpec_->filter()  Split [Hive: file:/data/home/wypb/data/code/apache/presto/presto-native-execution/target/velox_data/iceberg_data/HIVE/tpcds/item/data/a26a5cf6-ecb3-465b-9e60-f4987d5e6b3b.orc 0 - 131054] Task 20250211_030619_00139_748k5.7.0.0.0 Operator: TableScan[3] 0
	at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
	at Unknown.# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_(Unknown Source)
	at Unknown.# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
	at Unknown.# 3  _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_(Unknown Source)
	at Unknown.# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
	at Unknown.# 5  _ZN8facebook5velox4dwrf28SelectiveDecimalColumnReaderIlE4readElRKN5folly5RangeIPKiEEPKm(Unknown Source)
	at Unknown.# 6  _ZN8facebook5velox4dwio6common31SelectiveStructColumnReaderBase4readElRKN5folly5RangeIPKiEEPKm(Unknown Source)
	at Unknown.# 7  _ZN8facebook5velox4dwio6common31SelectiveStructColumnReaderBase4nextEmRSt10shared_ptrINS0_10BaseVectorEEPKNS2_8MutationE(Unknown Source)
	at Unknown.# 8  _ZN8facebook5velox4dwrf13DwrfRowReader8readNextEmPKNS0_4dwio6common8MutationERSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 9  _ZN8facebook5velox4dwrf13DwrfRowReader4nextEmRSt10shared_ptrINS0_10BaseVectorEEPKNS0_4dwio6common8MutationE(Unknown Source)
	at Unknown.# 10 _ZN8facebook5velox9connector4hive7iceberg18IcebergSplitReader4nextEmRSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 11 _ZN8facebook5velox9connector4hive14HiveDataSource4nextEmRN5folly10SemiFutureINS4_4UnitEEE(Unknown Source)
	at Unknown.# 12 _ZN8facebook5velox4exec9TableScan9getOutputEv(Unknown Source)
	at Unknown.# 13 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE3_clEv(Unknown Source)
	at Unknown.# 14 _ZN8facebook5velox4exec6Driver21withDeltaCpuWallTimerIZNS2_11runInternalERSt10shared_ptrIS2_ERS4_INS1_13BlockingStateEERS4_INS0_9RowVectorEEEUlvE3_EEvPNS1_8OperatorEMNS1_13OperatorStatsENS0_13CpuWallTimingEOT_(Unknown Source)
	at Unknown.# 15 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
	at Unknown.# 16 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
	at Unknown.# 17 _ZZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS2_EENKUlvE_clEv(Unknown Source)
	at Unknown.# 18 _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 19 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 20 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
	at Unknown.# 21 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
	at Unknown.# 22 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
	at Unknown.# 23 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
	at Unknown.# 24 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
	at Unknown.# 25 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
	at Unknown.# 26 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 27 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 28 _ZZN5folly18NamedThreadFactory9newThreadEONS_8FunctionIFvvEEEENUlvE_clEv(Unknown Source)
	at Unknown.# 29 _ZSt13__invoke_implIvZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_(Unknown Source)
	at Unknown.# 30 _ZSt8__invokeIZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS8_DpOS9_(Unknown Source)
	at Unknown.# 31 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE(Unknown Source)
	at Unknown.# 32 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEEclEv(Unknown Source)
	at Unknown.# 33 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS3_8FunctionIFvvEEEEUlvE_EEEEE6_M_runEv(Unknown Source)
	at Unknown.# 34 0x00000000000c2b23(Unknown Source)
	at Unknown.# 35 start_thread(Unknown Source)
	at Unknown.# 36 clone(Unknown Source)

@wypb
Copy link
Contributor Author

wypb commented Feb 11, 2025

@aditi-pandit I wanted to solve the error reported above today, and then I found that facebookincubator/velox#11067 has already solved it. If I want to add all the tests of TPCDS, I have to wait until that PR is solved. Or should we not add the tpcds test first? Do you have any suggestions?

@aditi-pandit
Copy link
Contributor

@wypb : Yes, facebookincubator/velox#11067 should fix your issue.

Lets add the test to this PR so that its complete. Do we know how many queries fail ? We can disable them until the Velox PR is checked in which should be quite soon.

@wypb
Copy link
Contributor Author

wypb commented Feb 12, 2025

@aditi-pandit 12 out of 102 SQL statements failed due to the above problems. I added the other SQL statements that could run successfully to the test.

Copy link

linux-foundation-easycla bot commented Feb 12, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@wypb wypb force-pushed the orc_reader branch 8 times, most recently from d0b75b9 to 0d7b926 Compare February 12, 2025 09:42
@aditi-pandit
Copy link
Contributor

@wypb : Thanks for iterating so quickly. Your PR is looking good. Please can you merge your commits into a single one. We can then submit.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ORC Reader support and TPCDS iceberg query updates should belong to separate commits/PRs.

@majetideepak
Copy link
Collaborator

majetideepak commented Feb 12, 2025

doDeletes for example is independent of ORC Reader Support?

@majetideepak
Copy link
Collaborator

The motivation to keep TPCDS test updates minimal is that if due to any reason we need to revert a test, the ORC Reader support will not be impacted.

@majetideepak
Copy link
Collaborator

Also, Native tables seem to be using the same test for Parquet and ORC. Why not do the same for Iceberg?

@aditi-pandit
Copy link
Contributor

Also, Native tables seem to be using the same test for Parquet and ORC. Why not do the same for Iceberg?

@majetideepak : Didn't follow the question entirely... Can you give more specifics ?

@aditi-pandit
Copy link
Contributor

The motivation to keep TPCDS test updates minimal is that if due to any reason we need to revert a test, the ORC Reader support will not be impacted.

@majetideepak : Am a bit conflicted on this. The ORC Reader commit should submit the functional tests that establish the feature I feel. This work takes care of that.

If there are specific tests that show issues (like the current Decimal one), then we should disable and fix them individually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants