Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Regarding include metadata on document parsed using llamaparse #17609

Open
1 task done
arun13ak opened this issue Jan 24, 2025 · 2 comments
Open
1 task done
Labels
question Further information is requested

Comments

@arun13ak
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

in this code when i ran:
vanilaParsing = LlamaParse(result_type="markdown").load_data(
"./expense_report_document.pdf"
)
it returns empty dictionary on metadata

@arun13ak arun13ak added the question Further information is requested label Jan 24, 2025
@arun13ak
Copy link
Author

@dosu how can we add metadata on parsing document using llamaparse

Copy link

dosubot bot commented Jan 24, 2025

To add metadata when parsing a document using LlamaParse, ensure that the metadata is correctly set up in the Document object and that the extractors are properly configured to populate the metadata fields. Here's how you can do it:

  1. Define the Document with Metadata: When creating a Document, include metadata directly in the constructor. This metadata will be available for extraction and processing.

    from llama_index.core import Document
    
    document = Document(
        text="sample text",
        metadata={"filename": "README.md", "category": "codebase"},
    )
  2. Use Extractors to Populate Metadata: Ensure that your extractors are set up to extract and populate metadata fields. The extractors should process the document text and update the metadata.

    extractors: List[TransformComponent] = [
        TitleExtractor(nodes=5),
        QuestionsAnsweredExtractor(questions=3),
        SummaryExtractor(summaries=["prev", "self"]),
        KeywordExtractor(keywords=10),
    ]
    
    node_parser: TransformComponent = SentenceSplitter()
    
    nodes = run_transformations([document], [node_parser, *extractors])
    
    assert "document_title" in nodes[0].metadata
    assert "questions_this_excerpt_can_answer" in nodes[0].metadata
    assert "section_summary" in nodes[0].metadata
    assert "excerpt_keywords" in nodes[0].metadata
  3. Check Metadata Configuration: If the metadata is still empty, verify that:

    • The document text is suitable for extraction.
    • The extractors are correctly configured and capable of extracting the desired metadata.
    • The run_transformations function is correctly implemented to apply these extractors and update the metadata.
  4. Advanced Metadata Customization: Customize how metadata is included in the document for LLM and embedding models. For example, exclude certain metadata keys from being visible to the LLM or embedding model.

    document.excluded_llm_metadata_keys = ["file_name"]
    document.excluded_embed_metadata_keys = ["file_name"]

    Use the get_content() function to test what the LLM or embedding model will read:

    from llama_index.core.schema import MetadataMode
    
    print(document.get_content(metadata_mode=MetadataMode.LLM))

By following these steps and ensuring proper configuration, the metadata should be populated as expected when parsing documents with LlamaParse [1][2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant