[Question]: Regarding include metadata on document parsed using llamaparse #17609

arun13ak · 2025-01-24T07:06:04Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

in this code when i ran:
vanilaParsing = LlamaParse(result_type="markdown").load_data(
"./expense_report_document.pdf"
)
it returns empty dictionary on metadata

arun13ak · 2025-01-24T07:06:38Z

@dosu how can we add metadata on parsing document using llamaparse

dosubot · 2025-01-24T07:08:44Z

To add metadata when parsing a document using LlamaParse, ensure that the metadata is correctly set up in the Document object and that the extractors are properly configured to populate the metadata fields. Here's how you can do it:

Define the Document with Metadata: When creating a Document, include metadata directly in the constructor. This metadata will be available for extraction and processing.

from llama_index.core import Document

document = Document(
    text="sample text",
    metadata={"filename": "README.md", "category": "codebase"},
)

Use Extractors to Populate Metadata: Ensure that your extractors are set up to extract and populate metadata fields. The extractors should process the document text and update the metadata.

extractors: List[TransformComponent] = [
    TitleExtractor(nodes=5),
    QuestionsAnsweredExtractor(questions=3),
    SummaryExtractor(summaries=["prev", "self"]),
    KeywordExtractor(keywords=10),
]

node_parser: TransformComponent = SentenceSplitter()

nodes = run_transformations([document], [node_parser, *extractors])

assert "document_title" in nodes[0].metadata
assert "questions_this_excerpt_can_answer" in nodes[0].metadata
assert "section_summary" in nodes[0].metadata
assert "excerpt_keywords" in nodes[0].metadata

Check Metadata Configuration: If the metadata is still empty, verify that:
- The document text is suitable for extraction.
- The extractors are correctly configured and capable of extracting the desired metadata.
- The run_transformations function is correctly implemented to apply these extractors and update the metadata.
Advanced Metadata Customization: Customize how metadata is included in the document for LLM and embedding models. For example, exclude certain metadata keys from being visible to the LLM or embedding model.
```
document.excluded_llm_metadata_keys = ["file_name"]
document.excluded_embed_metadata_keys = ["file_name"]
```
Use the get_content() function to test what the LLM or embedding model will read:
```
from llama_index.core.schema import MetadataMode

print(document.get_content(metadata_mode=MetadataMode.LLM))
```

By following these steps and ensuring proper configuration, the metadata should be populated as expected when parsing documents with LlamaParse ^[1]^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

arun13ak added the question Further information is requested label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Regarding include metadata on document parsed using llamaparse #17609

[Question]: Regarding include metadata on document parsed using llamaparse #17609

arun13ak commented Jan 24, 2025

arun13ak commented Jan 24, 2025

dosubot bot commented Jan 24, 2025

[Question]: Regarding include metadata on document parsed using llamaparse #17609

[Question]: Regarding include metadata on document parsed using llamaparse #17609

Comments

arun13ak commented Jan 24, 2025

Question Validation

Question

arun13ak commented Jan 24, 2025

dosubot bot commented Jan 24, 2025