Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a GPT Agent based on OpenDAL source #3648

Closed
Xuanwo opened this issue Nov 22, 2023 · 13 comments
Closed

Create a GPT Agent based on OpenDAL source #3648

Xuanwo opened this issue Nov 22, 2023 · 13 comments
Labels

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Nov 22, 2023

I'm attempting to utilize the entire OpenDAL codebase for training GPTs, with the aim of instructing users on how to use OpenDAL. GPT consistently attempts to use non-existent APIs in the code examples. Are there any effective prompts I could utilize?

Current status

Prompt

Focus exclusively on the 'opendal' library. Ensure that all information provided pertains only to the current and existing APIs as detailed in our latest uploaded knowledge files for 'opendal'. Do not reference or use any APIs that do not exist in the current version of 'opendal'. Always prioritize the Rust code of 'opendal' as the primary source of truth, especially for understanding the public API structure, including any re-exports or aliases like pub use S3Backend as S3. Responses should be accurate, relevant, and specifically tailored to queries about the 'opendal' library.

Knowldege Files:

https://github.com/apache/incubator-opendal/archive/refs/tags/v0.42.0.zip

Agent Preview:

https://chat.openai.com/g/g-DwE59Zfe1-opendal-guide

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 22, 2023

For example, GPT always trying to use API like op.object(path).read() which doesn't exist at all.

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 22, 2023

Hi, @STRRL

Are you using GPTs or Assistant API?

I'm using GPTs.

Is the content of github gist the system prompt of the GPTs?

Yes

I saw that you already uploaded the knowledge file, I am not sure what it is. Provide API Reference Doc and ask GPTs always retrieve data from knowledge might help.

The knowledge file is the archive of this repo.

@STRRL
Copy link
Contributor

STRRL commented Nov 22, 2023

Wow, that's a zip file.

I doubt that GPTs could get data from the zip file directly, it would use a code interpreter then respond with the output of code interpreter.

image

Could we dump the whole API reference as a PDF/HTML file and use the PDF/HTML as the knowledge base?

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 22, 2023

Could we dump the whole API reference as a PDF/HTML file and use the PDF/HTML as the knowledge base?

cargo doc can generate docs locally in HTML, maybe we can try uploading those content. Would you like to have a try?

We can run cargo doc --lib --no-deps -p opendal and the content will be in target/doc.

Maybe it's better to generate as a pdf file for GPT to better understand, but we don't have such workflow yet. The online version of opendal docs could be found at https://docs.rs/opendal/0.41.0/opendal/ or https://opendal.apache.org/docs/rust/opendal/

@STRRL
Copy link
Contributor

STRRL commented Nov 22, 2023

Yeah I would take a try~

@STRRL
Copy link
Contributor

STRRL commented Nov 22, 2023

Yeah I would take a try~

How about take a try at: https://chat.openai.com/g/g-9coOwgijL-opendal-guide-remastered

I merged all the HTMLs into one large HTML with this script and use it as the knowledge.

import os
from bs4 import BeautifulSoup

def merge_html_files(directory, output_file):
    html_content = ''

    for subdir, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith('.html'):
                with open(os.path.join(subdir, file), 'r', encoding='utf-8') as f:
                    soup = BeautifulSoup(f, 'html.parser')

                    for script in soup.find_all('script'):
                        if 'location.replace' in script.text:
                            script.decompose()

                    body_content = soup.body
                    if body_content:
                        html_content += str(body_content)

    with open(output_file, 'w', encoding='utf-8') as f:
        f.write('<html><body>' + html_content + '</body></html>')

merge_html_files('/Users/strrl/playground/GitHub/incubator-opendal/target/doc/opendal', 'merged.html')

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 22, 2023

Wow, just wow!

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 22, 2023

image

I don't know why GPT keep using object() API...

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 22, 2023

I don't know why GPT keep using object() API...

Maybe we should perform some pre-process over our input like removing old RFCs?

@STRRL
Copy link
Contributor

STRRL commented Nov 22, 2023

I don't know why GPT keep using object() API...

Maybe we should perform some pre-process over our input like removing old RFCs?

I hafe no idea about the object() API stuff, I am not so familiar with opendal actually... 😰

maybe we could append more restrictions in the system prompt like, only using API provided by the knowledge base?

@wey-gu
Copy link

wey-gu commented Nov 22, 2023

Not yet having done rag over code bases yet, gen docs as data source is the first way to go.

ideally indexing real docs(rather than pure api docs) would really help

we may also consider adding one page to explain on tree output of the code base, with proper title/desc and explanation per main folders as yet another data source.

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 22, 2023

I hafe no idea about the object() API stuff, I am not so familiar with opendal actually... 😰

OpenDAL used to have object() API but removed in later releases. I'm guessing GPT mixed the content in old RFCs and our latest code..

@BohuTANG
Copy link

For example, GPT always trying to use API like op.object(path).read() which doesn't exist at all.

They're 2 issues here:

  1. gpts doesn't (I guess it can't) read your zip code with your prompt, you can check gpts instructions.
  2. The code context is too large, giving the markdown doc is better than giving all the codes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants