Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: CSV loader support #29

Open
1 task done
Tachikoma000 opened this issue Sep 19, 2024 · 0 comments
Open
1 task done

feat: CSV loader support #29

Tachikoma000 opened this issue Sep 19, 2024 · 0 comments
Assignees

Comments

@Tachikoma000
Copy link
Contributor

Tachikoma000 commented Sep 19, 2024

  • I have looked for existing issues (including closed) about this

Feature Request: Implement CSV Loader for Document Processing

rig-core/src/loaders/csv.rs -> CsvFileLoader

Motivation

As users of Rig often need to work with structured data stored in CSV files, we need a way to easily load and process CSV documents for use in RAG systems and other document processing tasks. A CSV loader would allow users to incorporate tabular data into their NLP pipelines, enhancing the versatility of Rig for various use cases such as data analysis, information retrieval, and content summarization.

Proposal

Implement a CsvLoader struct that implements the DocumentLoader trait. The loader should:

  1. Accept a file path to a CSV file.
  2. Parse the CSV file using the csv crate.
  3. Convert the CSV data into a format suitable for embedding and further processing within Rig.
  4. Handle potential errors such as file not found, parsing errors, or invalid CSV structures.
  5. Provide options for customization, such as specifying delimiters or handling headers.

The implementation should focus on converting CSV data into a single document for embedding, with each row formatted as "header: value" pairs, separated by newlines.

Alternatives

  1. Row-based Embedding: Instead of creating a single document, we could create separate embeddings for each row. This would allow for more granular retrieval but might increase processing time and storage requirements.

    Drawbacks: Increased complexity in implementation and potential performance impact for large CSV files.

  2. Using pandas-like library: We could use a more robust data processing library like polars to handle CSV files, which might offer more advanced features for data manipulation.

    Drawbacks: Introduces a heavier dependency, which might not be necessary for simple CSV processing.

  3. Custom parsing without csv crate: We could implement CSV parsing without relying on the csv crate, giving us more control over the parsing process.

    Drawbacks: Reinventing the wheel, potentially introducing bugs, and increasing maintenance burden.

The proposed solution was chosen because it offers a good balance between simplicity, performance, and flexibility. It leverages the well-tested csv crate for parsing while allowing for future enhancements if more advanced features are needed.

@Tachikoma000 Tachikoma000 self-assigned this Sep 19, 2024
@0xMochan 0xMochan changed the title feat: Add CSV Loader to Document Loaders in Rig feat: CsvFileLoader Dec 19, 2024
@0xMochan 0xMochan changed the title feat: CsvFileLoader feat: CSV loader support Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants