Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question on vectorizer using word2vec #5

Open
bdqnghi opened this issue Sep 27, 2017 · 1 comment
Open

A question on vectorizer using word2vec #5

bdqnghi opened this issue Sep 27, 2017 · 1 comment

Comments

@bdqnghi
Copy link

bdqnghi commented Sep 27, 2017

Hey, thanks for this awesome implementation, this is exactly what i'm looking for since the details of the paper is not trivial to understand.

In the vectorizer part, you adopt word2vec technique to train the embedding for the AST, that's great. But I don't understand the intuition behind this, is there any reference?.

In word2vec, the embedding look up serves as a look up table, and the input is a one-hot encoding vector, if we multiply the one-hot encoding input with the embedding matrix, it will effectively just select the matrix row corresponding to the "1" in the input.

But in this case, seems not the same, after learning the embeddings, you save the embeddings along with NODE_MAP((the dictionary to store index of token in your implementation) into the pickle. how can we know that the index of the vector in the embedding table will match with the index in the NODE_MAP?

@lolongcovas
Copy link

Yes, the original paper uses the paper Building Program Vector Representations for Deep Learning to embed the AST node into a feature vector. This approach is quite similar to the word2vec, where the contextual information is the children in the case of AST. The source code of this implementation is found here. Looking at the code (it is a bit hard to understand...), it seems that for each AST, they build a new neural network (NN) with the same parameter W and b (for example, the NN of a AST of 2 and 3 levels will be different in terms of forward pass, but they have the same W and b).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants