Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DynamicEmbedding RFC #446

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

divyashreepathihalli
Copy link

Added DynamicEmbedding RFC

DynamicEmbedding(
input_dim=5,
output_dim=2,
input_length=5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support the inputs with dynamic shapes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input to the layer can be dynamic but if you are asking if input_dim which would be same as vocabulary size - this is not dynamic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thank you for your answer! I would like to know what the input_dim means. From my understanding, input_dim should be less or equal to the vocabulary size, which is fixed when training going on, is it right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_dim should be vocabulary size

Copy link
Member

@rhdong rhdong May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification! If the input_dim and vocabulary size are not dynamic, some critical scenarios may not be supported. Some industrial scenarios of real dynamic embedding request the algorithm engineers to use uint64_t for the encoded features which has a possible range of [0, std::numeric_limits<uint64_t>::max]. That means the input_dim and vocabulary size should not be set cause it's almost unlimited.

Copy link
Member

@rhdong rhdong May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong, I would like to clarify that for the layer initialization inp_dim is vocabulary size(tried to keep it consistent with the [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding#:~:text=.Embedding(-,input_dim%2C,-output_dim%2C%0A%C2%A0%20%C2%A0%20embeddings_initializer) - . The input to the layer can be of any dynamic shape.

Hi @divyashreepathihalli, thank you for your clarification. I understand now. About The input to the layer can be of any dynamic shape. it total makes sense. But I'm afraid that the input_dim setting would limit the features encoding space. In the dynamic embedding context(compared to the original static embedding in current TensorFlow), the input_dim should be std::numeric_limits<uint64_t>::max. I would try to explain it in one google doc. Before that, possibly you could refer to the TFRA API design that only the embedding_size need to be configured (similar with out_dim) https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/layers/embedding.py#L117

Copy link

@MoFHeka MoFHeka May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong, I would like to clarify that for the layer initialization inp_dim is vocabulary size(tried to keep it consistent with the [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding#:~:text=.Embedding(-,input_dim%2C,-output_dim%2C%0A%C2%A0%20%C2%A0%20embeddings_initializer) - . The input to the layer can be of any dynamic shape.

Possibly, I think @divyashreepathihalli may a little misunderstand the meaning of dynamic shape embedding. For example, there is a training feature input that are both large-scale and sparse, such as USER_ID. If we apply the vocabulary method to USER_ID, it will only map USER_ID to the dimension of vocabulary size, which is a compression of the information dimension. Since the vocabulary size is fixed, this is still a static embedding. Dynamic embedding means that all inputs can be processed with a non-conflicting method through a hashmap. The size of the dynamic embedding is not fixed and is unpredictable because the USER_ID grows with the growth of the business.

Copy link

@thorneliu thorneliu Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the example of USER_ID by @MoFHeka, in our recommender system, we use user&item crossed features to enhance the accuracy and relevance of our recommendations. By combining multiple features into a unique identifier, we can create a more comprehensive representation of the relationship between users and items, resulting in better recommendations. When using tf.sparse.cross_hash or xxhash, a sparse key in the range of [0, std::numeric_limits<uint64_t>::max] is generated. For such a large-scale and sparse feature, a dynamic size is mandatory.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong @MoFHeka thank you for the clarification. I tried to read up further. If I understand correctly you are looking for a dynamic vocabulary size and a dynamic embedding matrix as well, correct? One that would keep growing?

As of now our scope of work will be limited to maintaining a fixed size vocabulary and fixed embedding size, updating the vocabulary based on inputs received by the layer and eviction policies. The embedding values will be remapped whenever the vocabulary is updated based on input patterns (most frequently seen input, TTL, etc). If the input key is not in the vocab it will be mapped to a default value, however we keep track of these keys and add it to the vocab when the updates are done in the callback(new keys are added in the vocab by kicking out old keys based on eviction policies specified)

Copy link
Member

@rhdong rhdong Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divyashreepathihalli It's my pleasure. Considering the practical scenario of dynamic embedding we reached out to, the hashtable-based dynamic vocabulary size would be a fundamental requirement. I guess one of the PROs of your current design is that there is no need to modify the tf.optimizer; that makes sense, but in addition to the considerations we discussed above, I'm also a little worried it will introduce the data consistency issue caused by decoupling the embedding indexing and embedding looking up, especially in eviction involved. Applying atomic or lock mechanisms on ID and embedding is challenging when they are operated in two separate OPs.


The image below illustrates the workflow when the parameter server
strategy is used. PSS supports asynchronous training. Each worker
will have a copy of the vocabulary, which will be consistent across
Copy link
Member

@rhdong rhdong May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @divyashreepathihalli, may I have your confirmation here? If it means each worker will hold a full set of vocabulary that maps the vocab to index, and the real embedding vectors stored in some PSs with dense format(for example the tf.Variable)? Am I correct? Thank you so much!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Each worker should have a copy of the vocabulary( vocab->index mapping). The embedding variable will be split in distributed servers.

Copy link
Member

@rhdong rhdong May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @divyashreepathihalli, thank you for your comment! If we have a full set copy of the key-index mapping on each worker, there should be some upper limitations on the vocabulary size. To my best knowledge, some vocabulary size in some industrial scenarios can be tens or hundreds of billions, which causes the memory consumption on GPU/TPU to be significantly large and unbearable. One of the practical solutions is storing the key-value in the format of an abstract hashtable in a distributed way like TFRA. Hope it's helpful. Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. The proposed design would be the initial implementation and the distributed KV server would definitely be the way to go going forward.

@Mr-Nineteen
Copy link

Dynamic embedding is a very important feature for us.

When training the sorting model that supports scenarios such as search, recommendation, and advertisement, we encountered the following problems:

  1. For the feature selection of the sorting model in the e-commerce search and promotion scenario, the current industry mostly adopts the idea of large-scale discrete IDs. ID features (product IDs, user IDs, brand IDs, etc.) are large and sparse, and the native TF framework is not applicable.
  2. The TensorFlow variable has a fixed size and cannot dynamically increase the ID without restarting training.

With this feature, the main reasons:

  1. Support dynamic scale-out of dynamic embedding features at the TB level.

In this design approach, the DynamicEmbedding layer is composed of two
layers: the DynamicLookup layer and the Embedding layer. The
DynamicLookup layer is responsible for the following tasks:
* Maintaining a vocabulary table using an eviction policy that is
Copy link
Member

@Lifann Lifann Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently using parameter size in volume about 1E13 bytes in production. Will it be very expansive to maintain vocabulary and indexes for large parameter?

updated based on input pattern.
* Performing vocabulary lookup for the given input and returning
integer indexes.
* The index is then passed to the Embedding layer, which looks
Copy link
Member

@Lifann Lifann Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many case, we don't know how many keys in a feature exactly, since the property of videos, images, commodity, video games, etc. are always in change. Preset a vocab/index range may lead to waste in storage or feature conflicts.

updates corresponding to evolving input patterns and vocabulary changes.
### Goal
* Works across accelerators (GPU / TPU)
* Works with Parameter server strategy (asynchroous distributed training)
Copy link

@pjannaty pjannaty Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: asynchronous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants