Fix issue with duplicate contents in one document list #1256

Tesla2000 · 2024-10-15T17:09:08Z

If 2 documents in document_list have ideantical contents they both get added to the database because thet are not present in DB yet when self.vector_db.doc_exists(document) filter is applied. Added additional filter to fix that

ashpreetbedi · 2024-10-15T17:09:58Z

thanks @Tesla2000 , testing now

jacobweiss2305 · 2024-10-18T13:05:21Z

@Tesla2000 can you add this to AgentKnowledge?

Also I added logger statements and its not working as I would expect.

        logger.debug("Loading knowledge base")
        num_documents = 0
        for document_list in self.document_lists:
            documents_to_load = document_list
            # Upsert documents if upsert is True and vector db supports upsert
            if upsert and self.vector_db.upsert_available():
                self.vector_db.upsert(documents=documents_to_load, filters=filters)
                logger.debug(f"Upserted {len(documents_to_load)} documents")
            # Insert documents
            else:
                # Filter out documents which already exist in the vector db
                if skip_existing:
                    logger.debug(f"Start of {len(document_list)} documents")
                    document_list = {document.content: document for document in document_list}.values()
                    logger.debug(f"Filtering {len(document_list)} documents")
                    documents_to_load = [
                        document for document in document_list if not self.vector_db.doc_exists(document)
                    ]

                logger.debug(f"Inserting {len(documents_to_load)} documents")
                self.vector_db.insert(documents=documents_to_load, filters=filters)
            num_documents += len(documents_to_load)
            logger.debug(f"Added {len(documents_to_load)} documents to knowledge base")

Could you provide a way to test this?

I was using:
phidata\cookbook\agents\agent_with_storage.py

And added duplicate content:

db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
knowledge_base = PDFUrlKnowledgeBase(
    urls=[
        "https://phi-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",
        "https://phi-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf",
    ],
    vector_db=PgVector(table_name="recipes", db_url=db_url, search_type=SearchType.hybrid),
)

Tesla2000 · 2024-10-18T16:06:39Z

Content would be duplicated if you had 2 documents of the same contents in a single file, not 2 identical files.

Fix issue with duplicate contents in one document list

0180b52

Tesla2000 mentioned this pull request Oct 15, 2024

Duplicates comming from the same document list are not filtered #1255

Closed

Merge branch 'main' into main

39cf977

ysolanky approved these changes Jan 8, 2025

View reviewed changes

Merge branch 'main' into main

6fc61e1

ysolanky merged commit 9ac6594 into phidatahq:main Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with duplicate contents in one document list #1256

Fix issue with duplicate contents in one document list #1256

Tesla2000 commented Oct 15, 2024

ashpreetbedi commented Oct 15, 2024

jacobweiss2305 commented Oct 18, 2024

Tesla2000 commented Oct 18, 2024

Fix issue with duplicate contents in one document list #1256

Fix issue with duplicate contents in one document list #1256

Conversation

Tesla2000 commented Oct 15, 2024

ashpreetbedi commented Oct 15, 2024

jacobweiss2305 commented Oct 18, 2024

Tesla2000 commented Oct 18, 2024