Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with duplicate contents in one document list #1256

Merged
merged 3 commits into from
Jan 8, 2025

Conversation

Tesla2000
Copy link
Contributor

If 2 documents in document_list have ideantical contents they both get added to the database because thet are not present in DB yet when self.vector_db.doc_exists(document) filter is applied. Added additional filter to fix that

@ashpreetbedi
Copy link
Contributor

thanks @Tesla2000 , testing now

@jacobweiss2305
Copy link
Contributor

@Tesla2000 can you add this to AgentKnowledge?

Also I added logger statements and its not working as I would expect.

        logger.debug("Loading knowledge base")
        num_documents = 0
        for document_list in self.document_lists:
            documents_to_load = document_list
            # Upsert documents if upsert is True and vector db supports upsert
            if upsert and self.vector_db.upsert_available():
                self.vector_db.upsert(documents=documents_to_load, filters=filters)
                logger.debug(f"Upserted {len(documents_to_load)} documents")
            # Insert documents
            else:
                # Filter out documents which already exist in the vector db
                if skip_existing:
                    logger.debug(f"Start of {len(document_list)} documents")
                    document_list = {document.content: document for document in document_list}.values()
                    logger.debug(f"Filtering {len(document_list)} documents")
                    documents_to_load = [
                        document for document in document_list if not self.vector_db.doc_exists(document)
                    ]

                logger.debug(f"Inserting {len(documents_to_load)} documents")
                self.vector_db.insert(documents=documents_to_load, filters=filters)
            num_documents += len(documents_to_load)
            logger.debug(f"Added {len(documents_to_load)} documents to knowledge base")

Could you provide a way to test this?

I was using:
phidata\cookbook\agents\agent_with_storage.py

And added duplicate content:

db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
knowledge_base = PDFUrlKnowledgeBase(
    urls=[
        "https://phi-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",
        "https://phi-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",
    ],
    vector_db=PgVector(table_name="recipes", db_url=db_url, search_type=SearchType.hybrid),
)

@Tesla2000
Copy link
Contributor Author

Content would be duplicated if you had 2 documents of the same contents in a single file, not 2 identical files.

@ysolanky ysolanky merged commit 9ac6594 into phidatahq:main Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants