Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty .sa.tok files after select_functions & request to release self_training dataset #84

Open
PrithwishJana opened this issue Sep 28, 2022 · 0 comments

Comments

@PrithwishJana
Copy link

I am trying to create the self-training dataset, as per the instructions at https://github.com/facebookresearch/CodeGen/blob/main/docs/TransCoder-ST.md.

From Google BigQuery, I got 500 .json.gz files. Thereafter I preprocessed them and got the following symlinks successfully:

[abc@def CodeGen]$ ls xyz/java-FULL/XLM-syml/
test.java_cl.pth  train.java_cl.0.pth  train.java_cl.2.pth  train.java_sa.1.pth  valid.java_cl.pth
test.java_sa.pth  train.java_cl.1.pth  train.java_sa.0.pth  train.java_sa.2.pth  valid.java_sa.pth
[abc@def CodeGen]$

But now, as part of the final step, I am facing an issue on running create_self_training_dataset.sh. As per the following output that I am getting, all the .sa.tok files in the selected_functions folder are empty.

Repository root: .
python codegen_sources/test_generation/select_java_inputs.py --local True --input_path /home/xyz/CodeGen-data/java-FULL/ --output_path /home/xyz/CodeGen-data/dataset//selected_functions/ --rerun True
adding /project/6001889/xyz/CodeGen to path
adding to path /project/6001884/xyz/CodeGen
########## Selecting input functions ##########
100%|██████████| 500/500 [10:08:19<00:00, 73.00s/it] 
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000000.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000001.sa.tok
...
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000497.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000498.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000499.sa.tok

On debugging, I found that is_simple_standalone_func(func) in line 67 of at Link is returning False for all the Java functions. As such, the mask in line 114 in select_functions(funcpath) is an all-False list. Please suggest what to do in this case.

Also, it would be great if the authors can please release the training dataset of 135,000 parallel functions (as mentioned in the paper) between Java, Python, and C++, in the form of a shareable link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant