Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeQwen1.5模型支持跨文件级别的infilling续写吗? #25

Closed
Lanyu123 opened this issue Apr 23, 2024 · 20 comments
Closed

CodeQwen1.5模型支持跨文件级别的infilling续写吗? #25

Lanyu123 opened this issue Apr 23, 2024 · 20 comments
Assignees

Comments

@Lanyu123
Copy link

          https://github.com/QwenLM/CodeQwen1.5?tab=readme-ov-file#3-repository-level-code-completion

Originally posted by @huybery in #24 (comment)

@Lanyu123 Lanyu123 changed the title https://github.com/QwenLM/CodeQwen1.5?tab=readme-ov-file#3-repository-level-code-completion CodeQwen1.5模型支持跨文件级别的infilling续写吗? Apr 23, 2024
@Lanyu123
Copy link
Author

image

@Lanyu123
Copy link
Author

你好,可能我的问题没有描述清楚。你链接给的是对跨文件级别(repository level)的代码文件的续写,例子中的当前续写文件只有上半段,没有下半段,不是infilling的续写方式。我寻求的是在repository level级别的文件中,对当前文件做infilling代码续写,既要考虑跨文件内容,也要考虑当前文件的上下文,即在repository level+infilling的代码续写方式,这种要怎么组建prompt呢?文档里没有给出例子,我尝试用以下的prompt构建方式:

input_text = """<fim_prefix><reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)
    
    # Set up a student
    student = Student("Alice", "S1")
    
    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")
        
    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")
    
    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

但是似乎模型并不奏效,请问模型支持这种repository level+infilling的续写方式吗?我该怎么构建prompt呢?望请回复,十分感谢!

@mechigonft
Copy link

确实,我也想问这个问题,续写是只有上文信息,没有下文信息,而fill-in-the-middle模式,是基于上文和下文预测代码

@cyente cyente self-assigned this Apr 23, 2024
@cyente
Copy link
Collaborator

cyente commented Apr 23, 2024

跨文件级别的

你好,可能我的问题没有描述清楚。你链接给的是对跨文件级别(repository level)的代码文件的续写,例子中的当前续写文件只有上半段,没有下半段,不是infilling的续写方式。我寻求的是在repository level级别的文件中,对当前文件做infilling代码续写,既要考虑跨文件内容,也要考虑当前文件的上下文,即在repository level+infilling的代码续写方式,这种要怎么组建prompt呢?文档里没有给出例子,我尝试用以下的prompt构建方式:

input_text = """<fim_prefix><reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)
    
    # Set up a student
    student = Student("Alice", "S1")
    
    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")
        
    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")
    
    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

但是似乎模型并不奏效,请问模型支持这种repository level+infilling的续写方式吗?我该怎么构建prompt呢?望请回复,十分感谢!

跨文件级别的infilling的格式,模型是支持的,我们后续会将这个样例加入example。

具体格式上,<fim_prefix>指示的是需要infilling的文件的上文,因此格式如下:

input_text = """<reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
<fim_prefix>from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)
    
    # Set up a student
    student = Student("Alice", "S1")
    
    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")
        
    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")
    
    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

模型期望生成结果如下:

Generated text:     book = library.find_book("1234567890")

@Lanyu123
Copy link
Author

好的明白,感谢答疑

@cyente cyente closed this as completed Apr 23, 2024
@mechigonft
Copy link

请问,你们是否支持2个注释之间的代码段生成?比如
<fim_prefix>
// 注释1
<fim_suffix>
// 注释2
<fim_middle>
这样的话,模型是不是能够做到只生成注释1后续的代码段,而不会直接生成到方法的最后?

@cyente
Copy link
Collaborator

cyente commented Apr 23, 2024

<fim_prefix>

只要是符合fim结构的的格式理论上都支持,具体实践效果需要尝试

@mechigonft
Copy link

@cyente 你好,我刚刚测试了一下我说的“根据2个注释生成中间代码”的case,效果有好的一面有坏的一面,首先看一下我的生成结果:
image
好的一方面:这行查询的代码我觉得生成的非常不错,质量很高,基本可以直接用:List couponInstanceList = couponInstanceDAO.getCouponByAccountNo(listCouponBySpec.getMerchantId(), distributeSource, CouponStatusEnum.UN_USE.getStatus());
坏的一方面:我期望生成的是两段注释中间的代码,也就是说,我其实只想要“查询”逻辑的代码,没想到,模型,直接还给我返回了第二个注释的代码,也就是“校验”的代码
从这个现象中,我发现模型倾向于生成“从fim_prefix到整个方法结束的代码”,而不会准确识别我只是希望它生成到我指定的下文fim_suffix的位置。从生成的代码中可以看出,模型会做很多“多余的工作”
image

@mechigonft
Copy link

其实模型生成的代码很长,有很多多余的工作:校验、转换等,这些并不是我期望的事情,我只希望模型生成直到“校验注释”之前的代码

@mechigonft
Copy link

红色框是我期望生成的代码,蓝色框是模型做的多余的工作,校验、转换、其他查询.....,我看你上述给出的代码中,模型期望生成结果如下:
Generated text: book = library.find_book("1234567890")
只生成了一行代码,而我这边测试发现模型倾向于生成“非常多”代码,甚至会超过maxtoken而截断
image

@cyente
Copy link
Collaborator

cyente commented Apr 23, 2024

@cyente 你好,我刚刚测试了一下我说的“根据2个注释生成中间代码”的case,效果有好的一面有坏的一面,首先看一下我的生成结果: image 好的一方面:这行查询的代码我觉得生成的非常不错,质量很高,基本可以直接用:List couponInstanceList = couponInstanceDAO.getCouponByAccountNo(listCouponBySpec.getMerchantId(), distributeSource, CouponStatusEnum.UN_USE.getStatus()); 坏的一方面:我期望生成的是两段注释中间的代码,也就是说,我其实只想要“查询”逻辑的代码,没想到,模型,直接还给我返回了第二个注释的代码,也就是“校验”的代码 从这个现象中,我发现模型倾向于生成“从fim_prefix到整个方法结束的代码”,而不会准确识别我只是希望它生成到我指定的下文fim_suffix的位置。从生成的代码中可以看出,模型会做很多“多余的工作” image

看上面截图的例子里面,<fim_suffix>后面,除了注释以外,下文当中应该还包含了一些内容?

我猜测,你将suffix后的内容补上,会解决,你说的,它持续往下生成多余内容的问题;

如果还不行的话,建议采用一些后处理,比如只截断第一行的内容就可以满足需求

@cyente
Copy link
Collaborator

cyente commented Apr 23, 2024

以及,控制最大生成长度的参数

sampling_params = SamplingParams(temperature=xx, top_p=xx, repetition_penalty=xxx, max_tokens=256)

@mechigonft
Copy link

@cyente 哈喽,你好,是这样的,我的fim_suffix后面直到fim_middle,确实就是只有一个注释了,没有其他代码内容,也不该有其他代码内容,因为我这是在模拟一个真实程序员的写代码的逻辑:先写注释框架,再针对每个注释补全代码段,最终完成整个代码的编写。所以,我希望模型就是根据2段注释给我生成中间代码,也就是第一个注释的代码,到第二个注释为止。

你给的截断前n行的建议是可行的,只不过工程解法,比较生硬不灵活

@cyente
Copy link
Collaborator

cyente commented Apr 23, 2024

@cyente 哈喽,你好,是这样的,我的fim_suffix后面直到fim_middle,确实就是只有一个注释了,没有其他代码内容,也不该有其他代码内容,因为我这是在模拟一个真实程序员的写代码的逻辑:先写注释框架,再针对每个注释补全代码段,最终完成整个代码的编写。所以,我希望模型就是根据2段注释给我生成中间代码,也就是第一个注释的代码,到第二个注释为止。

你给的截断前n行的建议是可行的,只不过工程解法,比较生硬不灵活

您第二个注释是一个明确的指令,后面没有接任何代码解法,可能给模型带了一些困惑。可以多去尝试。

@mechigonft
Copy link

感谢答疑🙏

@mechigonft
Copy link

fill in middle的推理方式,支持加上instruct吗?比如,我加上指令:请参考代码上下文,只生成两段注释中间的代码

@mechigonft
Copy link

我理解fill in middle并不是对话式的推理方式,而是偏后台脚本解析成fill in middle的格式,而instruct/prompt这种是对话式的推理,那这两者能够融合使用吗?

@mechigonft
Copy link

如果可以融合的话,那我是不是可以通过指令的方式让模型知道不要生成太多的代码,生成到下一个注释为止就好

@cyente
Copy link
Collaborator

cyente commented Apr 23, 2024

对话式模型建议使用Qwen/CodeQwen1.5-7B-Chat

@mechigonft
Copy link

我的意思是这样的:prompt = instruct + fill in the middle
prompt:
请参考我提供的代码上下文,只生成两段注释中间的代码,不要生成多余代码。
<fim_prefix>
// 注释1
<fim_suffix>
// 注释2
<fim_middle>

这种,能不能把指令和fill in the middle两个模式结合使用

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants