Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when using optimized Parquet writer: ArrayIndexOutOfBoundsException: Index 128901 out of bounds for length 1 #5518

Closed
findepi opened this issue Oct 12, 2020 · 5 comments · Fixed by #9245
Assignees
Labels
bug Something isn't working
Milestone

Comments

@findepi
Copy link
Member

findepi commented Oct 12, 2020

SET SESSION hive.compression_codec = 'SNAPPY';
SET SESSION hive.parquet_optimized_writer_enabled = true;
CREATE TABLE store_sales_sf1_doubles WITH(format='PARQUET') AS
SELECT
    ss_sold_date_sk,
    ss_sold_time_sk,
    ss_item_sk,
    ss_customer_sk,
    ss_cdemo_sk,
    ss_hdemo_sk,
    ss_addr_sk,
    ss_store_sk,
    ss_promo_sk,
    ss_ticket_number,
    ss_quantity,
    CAST(ss_wholesale_cost AS double) ss_wholesale_cost,
    CAST(ss_list_price AS double) ss_list_price,
    CAST(ss_sales_price AS double) ss_sales_price,
    CAST(ss_ext_discount_amt AS double) ss_ext_discount_amt,
    CAST(ss_ext_sales_price AS double) ss_ext_sales_price,
    CAST(ss_ext_wholesale_cost AS double) ss_ext_wholesale_cost,
    CAST(ss_ext_list_price AS double) ss_ext_list_price,
    CAST(ss_ext_tax AS double) ss_ext_tax,
    CAST(ss_coupon_amt AS double) ss_coupon_amt,
    CAST(ss_net_paid AS double) ss_net_paid,
    CAST(ss_net_paid_inc_tax AS double) ss_net_paid_inc_tax,
    CAST(ss_net_profit AS double) ss_net_profit
FROM tpcds.sf1.store_sales;
Query 20201012_090634_00017_ud8nu, FAILED, 1 node
http://localhost:8080/ui/query.html?20201012_090634_00017_ud8nu
Splits: 38 total, 20 done (52.63%)
CPU Time: 74.6s total, 38.6K rows/s,     0B/s, 91% active
Per Node: 1.1 parallelism, 43.8K rows/s,     0B/s
Parallelism: 1.1
Peak Memory: 0B
1:06 [2.88M rows, 0B] [43.8K rows/s, 0B/s]

Query 20201012_090634_00017_ud8nu failed: Index 128901 out of bounds for length 1
java.lang.ArrayIndexOutOfBoundsException: Index 128901 out of bounds for length 1
	at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainLongDictionaryValuesWriter.fallBackDictionaryEncodedData(DictionaryValuesWriter.java:397)
	at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.fallBackAllValuesTo(DictionaryValuesWriter.java:130)
	at org.apache.parquet.column.values.fallback.FallbackValuesWriter.fallBack(FallbackValuesWriter.java:153)
	at org.apache.parquet.column.values.fallback.FallbackValuesWriter.checkFallback(FallbackValuesWriter.java:147)
	at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeLong(FallbackValuesWriter.java:181)
	at io.prestosql.parquet.writer.valuewriter.BigintValueWriter.write(BigintValueWriter.java:40)
	at io.prestosql.parquet.writer.PrimitiveColumnWriter.writeBlock(PrimitiveColumnWriter.java:127)
	at io.prestosql.parquet.writer.ParquetWriter.writeChunk(ParquetWriter.java:157)
	at io.prestosql.parquet.writer.ParquetWriter.write(ParquetWriter.java:147)
	at io.prestosql.plugin.hive.parquet.ParquetFileWriter.appendRows(ParquetFileWriter.java:110)
	at io.prestosql.plugin.hive.HiveWriter.append(HiveWriter.java:79)
	at io.prestosql.plugin.hive.HivePageSink.writePage(HivePageSink.java:318)
	at io.prestosql.plugin.hive.HivePageSink.doAppend(HivePageSink.java:270)
	at io.prestosql.plugin.hive.HivePageSink.lambda$appendPage$2(HivePageSink.java:256)
	at io.prestosql.plugin.hive.authentication.HdfsAuthentication.lambda$doAs$0(HdfsAuthentication.java:24)
	at io.prestosql.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
	at io.prestosql.plugin.hive.authentication.HdfsAuthentication.doAs(HdfsAuthentication.java:23)
	at io.prestosql.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:101)
	at io.prestosql.plugin.hive.HivePageSink.appendPage(HivePageSink.java:256)
	at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSink.appendPage(ClassLoaderSafeConnectorPageSink.java:69)
	at io.prestosql.operator.TableWriterOperator.addInput(TableWriterOperator.java:257)
	at io.prestosql.operator.Driver.processInternal(Driver.java:384)
	at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
	at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
	at io.prestosql.operator.Driver.processFor(Driver.java:276)
	at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
	at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
	at io.prestosql.$gen.Presto_unknown____20201012_085851_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
@rtjarvis
Copy link

Anyone looking to solve this?

@SamWheating
Copy link
Member

I believe that this is fixed by this PR:
apache/parquet-java#910

Which should be included in Parquet 1.13.

@polaris6
Copy link
Member

polaris6 commented Aug 26, 2021

I believe that this is fixed by this PR:
apache/parquet-mr#910

Which should be included in Parquet 1.13.

@SamWheating Looks like it’s not the same problem, I still get an error after test

java.lang.ArrayIndexOutOfBoundsException: Index 37404 out of bounds for length 1
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.fallBackDictionaryEncodedData(DictionaryValuesWriter.java:298)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.fallBackAllValuesTo(DictionaryValuesWriter.java:130)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.fallBack(FallbackValuesWriter.java:155)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.checkFallback(FallbackValuesWriter.java:149)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:173)
at io.trino.parquet.writer.valuewriter.CharValueWriter.write(CharValueWriter.java:45)
at io.trino.parquet.writer.PrimitiveColumnWriter.writeBlock(PrimitiveColumnWriter.java:124)
at io.trino.parquet.writer.ParquetWriter.writeChunk(ParquetWriter.java:157)
at io.trino.parquet.writer.ParquetWriter.write(ParquetWriter.java:147)
at io.trino.plugin.hive.parquet.ParquetFileWriter.appendRows(ParquetFileWriter.java:110)
at io.trino.plugin.iceberg.IcebergPageSink.writePage(IcebergPageSink.java:262)
at io.trino.plugin.iceberg.IcebergPageSink.doAppend(IcebergPageSink.java:215)
at io.trino.plugin.iceberg.IcebergPageSink.lambda$appendPage$0(IcebergPageSink.java:153)
at io.trino.plugin.hive.authentication.HdfsAuthentication.lambda$doAs$0(HdfsAuthentication.java:26)
at io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
at io.trino.plugin.hive.authentication.HdfsAuthentication.doAs(HdfsAuthentication.java:25)
at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:103)
at io.trino.plugin.iceberg.IcebergPageSink.appendPage(IcebergPageSink.java:153)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSink.appendPage(ClassLoaderSafeConnectorPageSink.java:69)
at io.trino.operator.TableWriterOperator.addInput(TableWriterOperator.java:263)
at io.trino.operator.Driver.processInternal(Driver.java:392)
at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291)
at io.trino.operator.Driver.tryWithLock(Driver.java:683)
at io.trino.operator.Driver.processFor(Driver.java:284)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
at io.trino.$gen.Trino_360____20210825_111942_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

@joshthoward
Copy link
Member

Adding a link to the Apache Jira tracking the issue https://issues.apache.org/jira/browse/PARQUET-1852.

@alexjo2144
Copy link
Member

Here's a quick write up of the problem for future reference:

The optimized writer uses the Parquet library's FallbackValuesWriter. This writer has two ValuesWriter instances, a primary one and a fallback.

The value writer interface also has two reset methods. reset() for when a page is finished, and resetDictionary() for after the dictionary page has been written. The out of bounds error was happening because reset only resets the state of the current ValuesWriter, if the fallback has been triggered, the primary writer is not reset. It doesn't switch back to the primary ValueWriter until resetDictionary is called. So reset() needs to be called again after resetDictionary in the case when the fallback was triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

6 participants