Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double sorting with aggregation not working #5120

Open
djklim87 opened this issue Jun 13, 2024 · 4 comments
Open

Double sorting with aggregation not working #5120

djklim87 opened this issue Jun 13, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@djklim87
Copy link

djklim87 commented Jun 13, 2024

Describe the bug

When we request collection with aggregation with sorting by two fields we see two bugs:

  • Missing sorting
  • Different results for each call

Steps to reproduce (if applicable)

  1. Download dataset and index config from https://dev2.manticoresearch.com/index-settings-and-data.zip
  2. Run Quickwit in Docker quickwit/quickwit:0.8.1
  3. Create index (config provided in attached archive):
export HOST='http://localhost:7280'

curl -s -XPOST "${HOST}/api/v1/indexes" \
    --header "content-type: application/yaml" \
    --data-binary @./index-config.yaml
  1. Upload data (Dataset is pretty big, so we split it into chunks):
split -l 10000 ./data.jsonl ./data_splitted.

echo "Starting loading"
for f in ./data_splitted.*; do
    echo "Upload chunk $f"
    curl -s -XPOST "${HOST}/api/v1/hn_small/ingest?commit=force" --data-binary @$f
    rm $f
done
echo "Finished"
  1. Perform query:
curl --location '${HOST}/api/v1/hn_small/search' \
--header 'Content-Type: application/json' \
--data '{"query":"*","max_hits":0,"aggs":{"comment_ranking_avg":{"terms":{"field":"comment_ranking","size":20,"order":{"avg_field":"desc","_key":"desc"}},"aggs":{"avg_field":{"avg":{"field":"author_comment_count"}}}}}}'
  1. We got results with the wrong sorting
{
    "num_hits": 1165439,
    "hits": [],
    "elapsed_time_micros": 6665,
    "errors": [],
    "aggregations": {
        "comment_ranking_avg": {
            "buckets": [
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 928.0 # Should be 2nd
                },
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 961.0 # Should be 1st
                },
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 730.0
                },
.....                
  1. if you repeat the request several times it can return different results (for the same query)
{
    "num_hits": 1165439,
    "hits": [],
    "elapsed_time_micros": 9610,
    "errors": [],
    "aggregations": {
        "comment_ranking_avg": {
            "buckets": [
                {
                    "avg_field": {
                        "value": 64.0
                    },
                    "doc_count": 1,
                    "key": 1305.0
                },
                {
                    "avg_field": {
                        "value": 117.0
                    },
                    "doc_count": 1,
                    "key": 1296.0
                },
                {
                    "avg_field": {
                        "value": 40.0
                    },
                    "doc_count": 1,
                    "key": 1289.0
                },
                {
                    "avg_field": {
                        "value": 87.0
                    },
                    "doc_count": 1,
                    "key": 1287.0
                },
......

PS: Sometimes it returns results without grouping. In that case you should reindex your dataset

"buckets": [
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 961.0
                },
                {
                    "avg_field": {
                        "value": 3080.0
                    },
                    "doc_count": 1,
                    "key": 980.0
                },
                {
                    "avg_field": {
                        "value": 3077.0
                    },
                    "doc_count": 1,
                    "key": 1176.0
                },

So generally we can get 3 different results for one query.

PS: Elasticsearch compatible URL has the same behaviour

Expected behavior
It should return the dataset like provided below

{
    "num_hits": 1165439,
    "hits": [],
    "elapsed_time_micros": 6665,
    "errors": [],
    "aggregations": {
        "comment_ranking_avg": {
            "buckets": [
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 961.0
                },
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key":  928.0
                },
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 730.0
                },
..... 

Configuration:
Please provide:

  1. Output of quickwit --version
Quickwit 0.8.1 (aarch64-unknown-linux-gnu 2024-03-29T14:09:41Z e6c5396)
  1. The index_config.yaml
 Provided in the attached archive)
@djklim87 djklim87 added the bug Something isn't working label Jun 13, 2024
@PSeitz PSeitz self-assigned this Jun 13, 2024
@PSeitz
Copy link
Contributor

PSeitz commented Jun 13, 2024

{
  "query": "*",
  "max_hits": 0,
  "aggs": {
    "comment_ranking_avg": {
      "terms": {
        "field": "comment_ranking",
        "size": 20,
        "order": {
          "avg_field": "desc",
          "_key": "desc"
        }
      },
      "aggs": {
        "avg_field": {
          "avg": {
            "field": "author_comment_count"
          }
        }
      }
    }
  }
}

This is not a correct way to define the order. It should be:

"order": [ { "avg_field": "desc" }, { "_key":"desc" } ] 

But currently this is not supported, only sort by one field is supported currently.

@djklim87
Copy link
Author

Provided order is not working also, but it's still not implemented

curl --location 'http://127.0.0.1:7280/api/v1/hn_small/search' \
--header 'Content-Type: application/json' \
--data '{
    "query": "*",
    "max_hits": 0,
    "aggs": {
        "comment_ranking_avg": {
            "terms": {
                "field": "comment_ranking",
                "size": 20,
                "order": [
                    {
                        "avg_field": "desc"
                    },
                    {
                        "_key": "desc"
                    }
                ]
            },
            "aggs": {
                "avg_field": {
                    "avg": {
                        "field": "author_comment_count"
                    }
                }
            }
        }
    }
}'
{
    "message": "invalid aggregation request: invalid type: sequence, expected a map at line 1 column 180"
}

So with an order by one key, it works fine and gives the same results each call.

Probably you just should notice somewhere in docs that you support now only one argument for sorting.

@fmassot
Copy link
Contributor

fmassot commented Jul 14, 2024

@PSeitz will the issue be closed with the merged PR #5121 ?

@PSeitz
Copy link
Contributor

PSeitz commented Jul 15, 2024

There's also quickwit-oss/tantivy#2451

But it's just covering error handling, not implementing order by multiple fields

@fmassot fmassot removed this from Quickwit 0.9 Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants