Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
DL committed Oct 8, 2024
1 parent f8b1350 commit d2453ba
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 1 deletion.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ The purpose of this package is to offer a convenient question-answering (RAG) sy
* Other common formats are supported by `Unstructured` pre-processor:
* List of formats see [here](https://unstructured-io.github.io/unstructured/core/partition.html).

* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.

* Supports multiple collection of documents, and filtering the results by a collection.

* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
Expand Down
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ Features
* Other common formats are supported by `Unstructured` pre-processor:
* List of formats https://unstructured-io.github.io/unstructured/core/partition.html

* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.

* Supports multiple collection of documents, and filtering the results by a collection.

* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
Expand Down
13 changes: 13 additions & 0 deletions sample_templates/generic/config_template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,19 @@ embeddings:
merge_sections: False # Merge # headings if possible, can be turned on and off depending on document stucture
remove_images: True # Remove image links

# Optional setting
pdf_table_parser: gmft # azuredoc

# Optional setting
pdf_image_parser:
image_parser: gemini-1.5-pro # gemini-1.5-flash
system_instructions: |
You are an research assistant. You analyze the image to extract detailed information. Response must be a Markdown string in the follwing format:
- First line is a heading with image caption, starting with '# '
- Second line is empty
- From the third line on - detailed data points and related metadata, extracted from the image, in Markdown format. Don't use Markdown tables.
passage_prefix: "passage: " # Often, specific prefix needs to be included in the source text, for embedding models to work properly
label: "documment-collection-1" # Add a label to the current collection

Expand Down
2 changes: 1 addition & 1 deletion sample_templates/test-templates/pdf_library.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ embeddings:
- epub
- md
- pdf
pdf_table_parser: azuredoc # gmft
pdf_table_parser: azuredoc # gmft # azuredoc # gmft
# pdf_image_parser:
# image_parser: gemini-1.5-pro # gemini-1.5-flash
# system_instructions: |
Expand Down

0 comments on commit d2453ba

Please sign in to comment.