Skip to content

whisperzqh/SC-CODE

Repository files navigation

On the Applicability of Code Language Models to Scientific Computing Programs

Benchmark Information

SC-CODE

./SC-CODE.zip contains the extracted NL-code pairs from the source code files that can support code generation and code summarization. It also contains the API doc information from the documentation that can serve as the knowledge base.

  • ./SC-CODE/NL_code

    • We extract the method and comment pairs based on heuristic rules, and following FSE'22 paper: Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization to clean the dataset.

    • data format: Method and NL are saved in jsonl format files.

      $lang/$split.jsonl
          {
              path: the path to the original file
              language: the programming language
              code: the code refers to function in the original file
              code_tokens: tokenized version of `code`
              docstring: the comment or docstring
              docstring_tokens: tokenized version of `docstring`
              partition: train/val/test
          }
      
  • ./SC-CODE/code_doc

    • data format: API and description information are saved in json format files.

      $lang_doc.json
          {
              name: API name
              code_string: API usage example code
              des_string: description of the API
          }
      
  • All the data extraction and pre-processing code is available at: ./SC-CODE/DataPreprocessing/

SC-API-CODE

./SC-API-CODE.zip contains the NL-code pairs that include rich API calls related to computational mathematics based on SC-CODE. It also contains the API doc information that belong to libraries related to mathematical calculations and graphics.

  • ./SC-API-CODE/NL_code

    • We extracted the API calling statements in the code using tree-sitter and saved the detailed information.

    • data format: Method and NL are saved in jsonl format files.

      $lang.jsonl
      {
          path: the path to the original file
          language: the programming language
          code: the code refers to function in the original file
          docstring: the comment or docstring
          partition: train/val/test in SC-CODE
          function_call: detailed information about the extracted API calls
          line_with_function_call: number of lines containing API calls
          function_call_in_doc: detailed information about the extracted API calls that can be found in the corresponding documentation
          num_of_func_call_in_doc: number of lines containing API calls that can be found in the corresponding documentation
      }
      
  • ./SC-API-CODE/code_doc

    • data format: API and description information are saved in jsonl format files.

      $lang_doc.jsonl
      {
          name: API name
          code_string: API usage example code
          des_string: description of the API
          category: libriary name
      }
      
  • ./SC-API-CODE/completion_data

    • We provide the data that we used for code completion task.

    • data format: both line-level and API-level code completion data are saved in jsonl format files.

      $lang-completion-line.jsonl
      {
      	 path: the path to the original file
      	 language: the programming language
      	 input: the input for LLM
      	 completion: the ground truth completion
      	 docstring: the comment or docstring
      }
      
      $lang-completion-api.jsonl
      {
      	 path: the path to the original file
      	 language: the programming language
      	 input_prefix: the prefix of the input for LLM
      	 api: the ground truth completion
      	 input_suffix: the suffix of the input for LLM
      	 docstring: the comment or docstring
      }
      

data for semantic correctness

./data_for_semantic_correctness.zip contains the high-quality computation-related NL-code pairs with test cases for semantic correctness evaluation of code generation. It also contains the evaluation script.

  • ./data_for_semantic_correctness/testcases_data

    • We meticulously generated two test cases for each piece of data for the semantic correctness evaluation.

    • data format: NL-code pair and test cases are saved in jsonl format files.

      $lang.jsonl
      {
          task_id: unique ID
          prompt: natural language comment and function signature, used as prompt during evaluation
          entry_point: function name
          reference_solution: ground truth
          test: test cases
      }
      
  • The automated evaluation script is available at: ./data_for_semantic_correctness/evaluation.py. Its use requires ensuring that the operating environment of Julia, MATLAB, and R is installed.

raw data

./raw_data.zip contains all the code files collected from GitHub repositories and the documentation collected from the official documentation of the three languages.

  • ./raw_data/github_data contains all the source code files from the collected GitHub repos.

    Julia MATLAB R
    # of repos 619 506 542
    • data format: Code corpus are saved in json format files.

      $lang_data.json
      {
          path: the source file path
          code_string: code string in the file
      }
      
  • ./raw_data/documentation contains the documentation information, including the description, API name, signature, parameters, usage examples (if has), etc.

    Julia MATLAB R
    # of documentation 2,211 4,142 6,851

Model Fine-tuning

Code Summarization

  • Fine-tuning CodeBERT

    Following CodeXGLEU, we fine-tune CodeBERT on our code summarization dataset using their provided scripts

  • CodeT5

    Following CodeT5, we fine-tune CodeT5 on our code summarization dataset using their provided scripts.

Code Generation

  • CodeGPT

    Following CodeXGLEU, we fine-tune CodeGPT on our code generation dataset using their provided scripts

  • CodeT5

    Following CodeT5, we fine-tune CodeT5 on our code generation dataset using their provided scripts.

Parameter-efficient Tuning

Following the documentation of adapter-transformers, we insert adapters to the studied pre-trained CLMs (CodeT5) and train the parameters of the adapters.

Zero/few-shot learning

The template NL-Code pairs are available at ./zero&few-shot-test/template_prompt

  • InCoder: Our implementation is based on the scripts provided by their GitHub repository.

  • CodeGen: Our implementation is based on the scripts provided by their GitHub repository.

  • StarCoder: Our implementation is based on the API provided by Hugging Face.

  • Code Llama: Our implementation is based on the API provided by Hugging Face.

  • OpenAI GPT-3.5 & GPT-4: Our implementation is based on the example scripts offered by OpenAI.

The implementation scripts of the above models are available at: ./zero&few-shot-test/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages