Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider increasing the dimension limit for vector fields. #40492

Closed
jtibshirani opened this issue Mar 26, 2019 · 14 comments · Fixed by #40597
Closed

Consider increasing the dimension limit for vector fields. #40492

jtibshirani opened this issue Mar 26, 2019 · 14 comments · Fixed by #40597
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch team-discuss

Comments

@jtibshirani
Copy link
Contributor

The dense_vector and sparse_vector fields place a hard limit of 500 on the number of dimensions per vector. However, many of the common pretrained text embeddings like BERT, ELMo, and Universal Sentence Encoder produce vectors of larger dimensions, typically ranging from 512 to 1024.

Currently users must truncate the vectors, or perform an additional dimensionality reduction step. Perhaps we could make the dimension limit configurable, or at least increase it to a larger value?

@jtibshirani jtibshirani added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Mar 26, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@mayya-sharipova
Copy link
Contributor

Thanks @jtibshirani

I don't see a problem of increasing the number of dimensions to 1024.

@jpountz do you see any problems for BinaryDocValuesField to have a value of 1024 X 4 = 4096 bytes ( in case of dense vectors)?
Or 1024 X 6 = 6144 bytes (in case of sparse vectors)?

@jpountz
Copy link
Contributor

jpountz commented Mar 28, 2019

This sounds good to me. I care more about the fact that there is a reasonable limit than about the actual value of the limit.

@SthPhoenix
Copy link

Hi, @mayya-sharipova!
Sorry for writing in old issue, but is it possible to increase limit once more?
It looks like there already exists NNets suitable for search tasks, exceeding 1024d limit.

For example mobilenet_v2 which produces 1280d vectors, as pointed out by @etudor in issue SthPhoenix/elastik-nearest-neighbors-extended#4

@mayya-sharipova
Copy link
Contributor

@SthPhoenix what should be a reasonable dims limit?

@SthPhoenix
Copy link

Actually I'm not sure, 1280d is largest embedding I've seen so far for common models.
I think 2048 might be sufficient for a while, but if there are no technical limitations maybe 4096 should be an overkill for long time.
Such large vector would heavily impact performance and memory footprint, but I think people who need this should know what they are doing.

@SthPhoenix
Copy link

Thanks! Hope this would be enough )

@gabrieldevopsai
Copy link

Hello guys, its possible to inscrease to 3072 dims?

@mayya-sharipova
Copy link
Contributor

@gabrielcustodio We have not encountered models or use case that require more than 2048 dims. Can you please describe your use-case or models that need this big number of dims?

@gabrieldevopsai
Copy link

I used this model
https://allennlp.org/elmo
on pt-br.

Actually the output is 3 layers with 1024 dims.

3x1024

Source: flairNLP/flair#886

I load this model using flair library and then extract the embeddings.

@kajca
Copy link

kajca commented Sep 25, 2020

Flair stacked embeddings (forward, backward, glove/flair) would produce vectors of more than 4096.

@caleb-artifact
Copy link

@gabrielcustodio We have not encountered models or use case that require more than 2048 dims. Can you please describe your use-case or models that need this big number of dims?

OpenAI / GPT-3 uses the following embedding sizes:
Ada - 1024 dimensions
Babbage - 2048 dimensions
Curie - 4096 dimensions
Davinci - 12288 dimensions

Getting support for at least Curie's embeddings would probably be a good idea. I understand that Davinci is extremely large but this technical limitation is probably going to force us to look for alternative solutions to Elastic Cloud if 4096 dimensions cannot be supported.

@kvara
Copy link

kvara commented Apr 21, 2022

It's 2022 , maybe its time to increase number of dimensions?

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented May 19, 2022

Also adding references to the Lucene discussions about increasing dims and why it may not be a good idea:

@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch team-discuss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants