Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] "arithmetic_exception: long overflow" while running sort query on date field #16709

Closed
NamrataNerli opened this issue Nov 22, 2024 · 16 comments
Assignees
Labels
bug Something isn't working Search Search query, autocomplete ...etc v2.19.0 Issues and PRs related to version 2.19.0

Comments

@NamrataNerli
Copy link

NamrataNerli commented Nov 22, 2024

Describe the bug

We are getting "arithmetic_exception: long overflow" intermittently when we sort the results on date field.
This is the query we are using:

{
   "query":{
      "bool":{
         "adjust_pure_negative":true,
         "must":[
            {
               "match":{
                  "id":"7273617yygs2t7812ygsy"
               }
            },
            {
               "match":{
                  "scope":"2020"
               }
            },
            {
               "bool":{
                  "adjust_pure_negative":true,
                  "must_not":[
                     {
                        "match":{
                           "status":"COMPLETED"
                        }
                     }
                  ],
                  "boost":1
               }
            }
         ],
         "boost":1
      }
   },
   "_source":{
      "excludes":[
         
      ],
      "includes":[
         "dueDate",
         "scope",
         "id",
         "status"
      ]
   },
   "sort":[
      {
         "dueDate":{
            "order":"asc"
         }
      }
   ]
}

This is error message :
Screenshot 2024-11-22 at 1 05 49 PM

Related component

Search

To Reproduce

We are using this date format on dueDate field.
Screenshot 2024-11-22 at 1 26 43 PM

This is query

{
   "query":{
      "bool":{
         "adjust_pure_negative":true,
         "must":[
            {
               "match":{
                  "id": "7273617yygs2t7812ygsy"
               }
            },
            {
               "match":{
                  "scope": "2020"
               }
            },
            {
               "bool":{
                  "adjust_pure_negative":true,
                  "must_not":[
                     {
                        "match":{
                           "status": "COMPLETED"
                        }
                     }
                  ],
                  "boost":1
               }
            }
         ],
         "boost":1
      }
   },
   "_source":{
      "excludes":[
         
      ],
      "includes":[
         "dueDate",
         "scope",
         "id",
         "status"
      ]
   },
   "sort":[
      {
         "dueDate":{
            "order":"asc"
         }
      }
   ]
}

Note: Some of the documents/records that are inserted on the index do have value for dueDate field and some of them do.

We tried changing the date format to this yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_optional_time, but still we were getting arithmetic_exception.

Expected behavior

We want to sort on date field without having any issues.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@NamrataNerli NamrataNerli added bug Something isn't working untriaged labels Nov 22, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Nov 22, 2024
@prudhvigodithi
Copy link
Member

prudhvigodithi commented Nov 25, 2024

[Triage]
Hey @NamrataNerli I was trying to re-produce the error but couldn't, please check the following sample and let me know if I'm missing anything, I was able to get the output without the arithmetic exception.

Can you please share some sample data/commands to re-produce this ?

curl -X PUT "localhost:9200/test_index" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "scope": {
        "type": "keyword"
      },
      "status": {
        "type": "keyword"
      },
      "dueDate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_optional_time"
      }
    }
  }
}'


curl -X POST "localhost:9200/test_index/_doc/1" -H 'Content-Type: application/json' -d'
{
  "id": "7273617yygs2t7812ygsy",
  "scope": "2020",
  "status": "IN_PROGRESS",
  "dueDate": "2024-11-22 13:00:00"
}'

curl -X POST "localhost:9200/test_index/_doc/2" -H 'Content-Type: application/json' -d'
{
  "id": "7273617yygs2t7812ygsy",
  "scope": "2020",
  "status": "IN_PROGRESS"
}'


curl -X POST "localhost:9200/test_index/_doc/3" -H 'Content-Type: application/json' -d'
{
  "id": "7273617yygs2t7812ygsy",
  "scope": "2020",
  "status": "IN_PROGRESS",
  "dueDate": "2024-11-23"
}'


curl -X POST "localhost:9200/test_index/_doc/4" -H 'Content-Type: application/json' -d'
{
  "id": "7273617yygs2t7812ygsy",
  "scope": "2020",
  "status": "IN_PROGRESS",
  "dueDate": null
}'

curl -X POST "localhost:9200/test_index/_doc/5" -H 'Content-Type: application/json' -d'
{
  "id": "7273617yygs2t7812ygsy",
  "scope": "2020",
  "status": "IN_PROGRESS"
}'

curl -X GET "localhost:9200/test_index/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "id": "7273617yygs2t7812ygsy"
          }
        },
        {
          "match": {
            "scope": "2020"
          }
        },
        {
          "bool": {
            "must_not": [
              {
                "match": {
                  "status": "COMPLETED"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "_source": {
    "includes": [
      "dueDate",
      "scope",
      "id",
      "status"
    ]
  },
  "sort": [
    {
      "dueDate": {
        "order": "asc"
      }
    }
  ]
}'

Output

{
 "took": 4,
 "timed_out": false,
 "_shards": {
   "total": 1,
   "successful": 1,
   "skipped": 0,
   "failed": 0
 },
 "hits": {
   "total": {
     "value": 5,
     "relation": "eq"
   },
   "max_score": null,
   "hits": [
     {
       "_index": "test_index",
       "_id": "1",
       "_score": null,
       "_source": {
         "scope": "2020",
         "dueDate": "2024-11-22 13:00:00",
         "id": "7273617yygs2t7812ygsy",
         "status": "IN_PROGRESS"
       },
       "sort": [
         1732280400000
       ]
     },
     {
       "_index": "test_index",
       "_id": "3",
       "_score": null,
       "_source": {
         "scope": "2020",
         "dueDate": "2024-11-23",
         "id": "7273617yygs2t7812ygsy",
         "status": "IN_PROGRESS"
       },
       "sort": [
         1732320000000
       ]
     },
     {
       "_index": "test_index",
       "_id": "2",
       "_score": null,
       "_source": {
         "scope": "2020",
         "id": "7273617yygs2t7812ygsy",
         "status": "IN_PROGRESS"
       },
       "sort": [
         9223372036854775807
       ]
     },
     {
       "_index": "test_index",
       "_id": "4",
       "_score": null,
       "_source": {
         "scope": "2020",
         "dueDate": null,
         "id": "7273617yygs2t7812ygsy",
         "status": "IN_PROGRESS"
       },
       "sort": [
         9223372036854775807
       ]
     },
     {
       "_index": "test_index",
       "_id": "5",
       "_score": null,
       "_source": {
         "scope": "2020",
         "id": "7273617yygs2t7812ygsy",
         "status": "IN_PROGRESS"
       },
       "sort": [
         9223372036854775807
       ]
     }
   ]
 }
}

Thank you
@msfroh @getsaurabh02

@NamrataNerli
Copy link
Author

NamrataNerli commented Nov 25, 2024

Hi @prudhvigodithi ,
We are seeing this issue intermittently on large dataset. On our cluster we have 14 indices & they have same mapping which is this

{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "scope": {
        "type": "keyword"
      },
      "status": {
        "type": "keyword"
      },
      "dueDate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||date_optional_time"
      }
    }
  }
}

When we execute the query against all these 14 indices we are getting arithmetic exceptions . This is how we are executing the query.

GET test-index-*/_search
{
   "query":{
      "bool":{
         "adjust_pure_negative":true,
         "must":[
            {
               "match":{
                  "id": "7273617yygs2t7812ygsy"
               }
            },
            {
               "match":{
                  "scope": "2020"
               }
            },
            {
               "bool":{
                  "adjust_pure_negative":true,
                  "must_not":[
                     {
                        "match":{
                           "status": "COMPLETED"
                        }
                     }
                  ],
                  "boost":1
               }
            }
         ],
         "boost":1
      }
   },
   "_source":{
      "excludes":[
         
      ],
      "includes":[
         "dueDate",
         "scope",
         "id",
         "status"
      ]
   },
   "sort":[
      {
         "dueDate":{
            "order":"asc"
         }
      }
   ]
}

This is the data distribution on the indices
Screenshot 2024-11-25 at 2 15 19 PM

When we tried to execute the same query against smaller dataset(i.e. against 10% of the above mentioned data volume) , we did not face any issues.

We are currently using OS version 2.11 & these are sample document

Document with dueDate value

{
    "scope": "2020",
    "status": "COMPLETED",
    "dueDate" : "2024-01-27T13:12:50.980",
    "id" :  "7273617yygs2t7812ygsy"
}

Document without dueDate value

{
    "scope": "2020",
    "status": "COMPLETED",
    "id" :  "7273617yygs2t7812ygsy"
}

@msfroh
Copy link
Collaborator

msfroh commented Nov 25, 2024

We are currently using OS version 2.11 & these are sample document

This may have been fixed by #12676, which was merged for OpenSearch 2.13.

Are you able to try upgrading to 2.13 or later and testing?

@NamrataNerli
Copy link
Author

Hi @msfroh ,

We upgraded the OS version to 2.13 but still we are seeing the same issue

@noeg-1
Copy link

noeg-1 commented Dec 4, 2024

Hello, everyone.

I replicated the issue by creating an index with different combinations of date formats for the dueDate field.

In my environment, the error was triggered when custom date formats were used (i.e., yyyy-MM-dd HH:mm:ss, yyyy-MM-dd), even when using each one of them alone.

When using "strict_date_optional_time" alone or paired with "epoch_millis" the issue did not show up.

# Reproduce the error ----------------------------------------

PUT my-index
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "domainAttributes": {
        "properties": {
          "dueDate": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||date_optional_time"
          },
          "firmId": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

# Ingested from 30 to 50 records without dueDate in order to reproduce the error.

POST _bulk
{"index":{"_index":"my-index"}}
{"domainAttributes":{"firmId":"12345678910111213"}}


# The issue is intermittent. Some searches are completely successful and others are not

GET my-index/_search
{
  "query" :
  {
    "match": {
      "domainAttributes.firmId": "12345678910111213"
      }
  },
  "sort":[
      {
         "domainAttributes.dueDate":{
            "order":"asc"
         }
      }
   ]
}


# Avoid the error ----------------------------------------------

# Possible workaround 1
# It succeeded every time it was tested. Tested with 3,500 records without dueDate, and 1 with a dueDate value (i.e., 2022-03-31T15:40:58.324)

GET my-index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "domainAttributes.firmId": "12345678910111213"
          }
        },
        {
          "exists": {
            "field": "domainAttributes.dueDate"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "domainAttributes.dueDate": {
        "order": "asc",
        "missing": "_last"
      }
    }
  ]
}


# Possible workaround 2
# Having only strict_date_optional_time seems to avoid the error (Tested with 200 records)
# Tested further with a common date (i.e., 2022-03-31T15:40:58.324) having 100 records with dueDate, and up to 3,000 records without dueDate. Searches succeeded every time.

PUT my-index
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "domainAttributes": {
        "properties": {
          "dueDate": {
            "type": "date",
            "format": "strict_date_optional_time"
          },
          "firmId": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Dec 4, 2024

Thanks @NamrataNerli and @noeg-1, following is the stack trace of the error.

root_cause": [
      {
        "type": "arithmetic_exception",
        "reason": "arithmetic_exception: long overflow",
        "stack_trace": "NotSerializableExceptionWrapper[arithmetic_exception: long overflow]\n\tat java.lang.Math.addExact(Math.java:903)\n\tat java.time.Instant.toEpochMilli(Instant.java:1237)\n\tat org.opensearch.index.mapper.DateFieldMapper$Resolution$1.convert(DateFieldMapper.java:121)\n\tat org.opensearch.index.mapper.DateFieldMapper$DateFieldType.parseToLong(DateFieldMapper.java:546)\n\tat org.opensearch.index.mapper.DateFieldMapper$DateFieldType.isFieldWithinQuery(DateFieldMapper.java:584)\n\tat org.opensearch.search.sort.FieldSortBuilder.isBottomSortShardDisjoint(FieldSortBuilder.java:486)\n\tat org.opensearch.search.internal.ShardSearchRequest$RequestRewritable.rewrite(ShardSearchRequest.java:550)\n\tat org.opensearch.search.internal.ShardSearchRequest$RequestRewritable.rewrite(ShardSearchRequest.java:532)\n\tat org.opensearch.index.query.Rewriteable.rewrite(Rewriteable.java:85)\n\tat 

@bowenlan-amzn

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Dec 5, 2024

After multiple tries finally I was able to reproduce the error, I have created a small error_hunter.sh bash script for other to re-produce this arithmetic_exception. Added some comments to the script to understand the flow.

#!/bin/bash

OS_URL="http://localhost:9200"
INDEX_NAME="my-index"
BULK_FILE="bulk_data.json"
SHARD_COUNT=6

# 1. Delete existing index if it exists
echo "Deleting existing index..."
curl -X DELETE "$OS_URL/$INDEX_NAME"
sleep 2

# 2. Create index with multiple shards
echo "Creating index with $SHARD_COUNT shards..."
curl -X PUT "$OS_URL/$INDEX_NAME" -H "Content-Type: application/json" -d '{
  "settings": {
    "number_of_shards": '$SHARD_COUNT',
    "index.max_result_window": 50000
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "domainAttributes": {
        "properties": {
          "dueDate": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||date_optional_time"
          },
          "firmId": {
            "type": "keyword"
          }
        }
      }
    }
  }
}'

sleep 2

# 3. TEst Data
echo "Generating distributed test data..."
> "$BULK_FILE"

# First add the records without dates (3500 docs)
for i in $(seq 1 3500); do
    routing_key=$((RANDOM % SHARD_COUNT))  # Distribute across shards
    echo '{"index":{"_index":"'$INDEX_NAME'","routing":"'$routing_key'"}}' >> "$BULK_FILE"
    echo '{"domainAttributes":{"firmId":"12345678910111213"}}' >> "$BULK_FILE"
done

# Add records with dates
for i in $(seq 1 50); do
    routing_key=$((RANDOM % SHARD_COUNT))
    date="2022-03-$((i % 30 + 1))T15:40:58.324"
    echo '{"index":{"_index":"'$INDEX_NAME'","routing":"'$routing_key'"}}' >> "$BULK_FILE"
    echo '{"domainAttributes":{"firmId":"12345678910111213","dueDate":"'$date'"}}' >> "$BULK_FILE"
done

# 4. Bulk Index the data
echo "Indexing data..."
curl -X POST "$OS_URL/_bulk" -H "Content-Type: application/json" --data-binary @"$BULK_FILE"

sleep 5

# 5. Function to perform search tests
run_search_test() {
    echo "Running search iteration $1..."
    response=$(curl -s -X GET "$OS_URL/$INDEX_NAME/_search?allow_partial_search_results=false&error_trace=true" -H "Content-Type: application/json" -d '{
      "query": {
        "match": {
          "domainAttributes.firmId": "12345678910111213"
        }
      },
      "sort": [
        {
          "domainAttributes.dueDate": {
            "order": "asc"
          }
        }
      ]
    }')
    
    if echo "$response" | grep -q "arithmetic_exception"; then
        echo "Found arithmetic exception!"
        echo "$response" | grep -A 2 "arithmetic_exception"
        return 0
    fi
    return 1
}

# 6. Run test iterations (running the query 1000 times)
echo "Running search tests..."
found_error=false
for i in $(seq 1 1000); do
    if run_search_test $i; then
        found_error=true
        break
    fi
    sleep 0.1
done

if [ "$found_error" = false ]; then
    echo "No error found after 100 attempts"
fi

# Clean up
rm "$BULK_FILE"
echo "Test completed."

I can see this error across the 2.x line (2.18, 2.17, 2.15, 2,13) but not on main branch. I will debug further to find the root cause and will create a PR.

Thank you.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Dec 6, 2024

There is a similar issue #5713 is past related to the same topic.
@dblock @soltmar

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Dec 6, 2024

I have a draft PR #16796 while we discuss on more on this, @noeg-1 and @NamrataNerli can you please try to clone my fork, run the cluster and see you can re-produce the error? With the fix part of the PR I dont see the long overflow error anymore.

  • git clone https://github.com/prudhvigodithi/OpenSearch.git -b 2.x
  • With jdk21 installed on local, run ./gradlew clean run -PnumNodes=6, this will bring up a cluster and can be accessed http://localhost:9200.

Thanks.

@noeg-1
Copy link

noeg-1 commented Dec 6, 2024

Hello, @prudhvigodithi. I have cloned the fork, run the cluster, and tried to trigger the Arithmetic Exception. The exception was not triggered in any of the tests I ran. I tested with the provided conditions, ingesting increasing amounts of documents with missing dueDate values, up to 20,000 documents.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Dec 8, 2024

Did some deep dive and looks like the issue is coming from joda time, tried adding the flow below at high level.

ShardSearchRequest ---> setBottomSortValues(SearchSortValuesAndFormats) --->SearchSortValuesAndFormats constructor---> For each sort value: --->DocValueFormat.format(rawValue)---> DocValueFormat.DateTime.format(long)-->Joda-Time processes Long.MAX_VALUE--->Overflow occurs--->Returns "+292278994-08-17T07:12:55"

From the logs

[2024-12-06T15:14:02,402][INFO ][o.o.s.i.ShardSearchRequest] [runTask-0] request.getBottomSortValues() 1647099658324
[2024-12-06T15:14:02,402][INFO ][o.o.s.i.ShardSearchRequest] [runTask-0] request.getBottomSortValues() 2022-03-12 15:40:58
[2024-12-06T15:14:03,398][INFO ][o.o.s.i.ShardSearchRequest] [runTask-0] request.getBottomSortValues() 9223372036854775807
[2024-12-06T15:14:03,398][INFO ][o.o.s.i.ShardSearchRequest] [runTask-0] request.getBottomSortValues() +292278994-08-17 07:12:55
[2024-12-06T15:14:03,399][INFO ][o.o.s.i.ShardSearchRequest] [runTask-0] request.getBottomSortValues() 9223372036854775807
[2024-12-06T15:14:03,399][INFO ][o.o.s.i.ShardSearchRequest] [runTask-0] request.getBottomSortValues() +292278994-08-17 07:12:55
[2024-12-06T15:14:03,399][INFO ][o.o.s.i.ShardSearchRequest] [runTask-0] request.getBottomSortValues() 9223372036854775807
[2024-12-06T15:14:03,399][INFO ][o.o.s.i.ShardSearchRequest] [runTask-0] request.getBottomSortValues() +292278994-08-17 07:12:55

The reason from beginning I was not able to reproduce the error on main branch is the joda time is deprecated on main and was not backported to 2.x as its a breaking change. Here are some related PR's #9350 , #9355, #2768.

Since we dont want to backport this to 2.x, I will add a generic solution on main which can clamp to a valid range when called toEpochMilli().

Also thanks a lot @noeg-1 for the testing. I have more generic solution to my fork (#16796) can you please test one more time if the results are as expected (from my end with the above script #16709 (comment) I dont see the error anymore). Once testing is done I will create a PR a main branch and then will backport it to 2.x.

@reta

@reta
Copy link
Collaborator

reta commented Dec 8, 2024

@reta

@msfroh tagging you here since you've been looking into similar (or same?) issue before, may be missing some edge cases? Thanks @prudhvigodithi for looking into it.

@noeg-1
Copy link

noeg-1 commented Dec 9, 2024

Hello, everyone. @prudhvigodithi, I have tested again, using the same conditions shared for reproducing the error. Tried again with up to 20,000 documents with missing values for dueDate. From my end, the Arithmetic Exception was not triggered in any of the tests

@prudhvigodithi
Copy link
Member

Thanks, I have the created a PR for main branch #16812 and we can backport to 2.x.
@bowenlan-amzn @msfroh

@prudhvigodithi prudhvigodithi moved this from Todo to In Progress in Performance Roadmap Dec 9, 2024
@bowenlan-amzn
Copy link
Member

import org.joda.time.DateTime;

public class Main {
    public static void main(String[] args) {
        DateTime dateTime = new DateTime(Long.MAX_VALUE);
        System.out.println(dateTime.toString());
    }
}

output 292278994-08-17T07:12:55.807Z

@prudhvigodithi prudhvigodithi added the v2.19.0 Issues and PRs related to version 2.19.0 label Dec 10, 2024
@prudhvigodithi
Copy link
Member

Closing this issue since the overflow prevention is added when handling date values, related PR #16812.
Thanks

@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Dec 27, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc v2.19.0 Issues and PRs related to version 2.19.0
Projects
Status: Done
Status: Done
Development

No branches or pull requests

6 participants