Skip to content

Commit

Permalink
Update fixed.py (#1704)
Browse files Browse the repository at this point in the history
method ```chunk``` was not taking into consideration the overlap length
when checking for the ```while``` loop and as a result it was running
endlessly making the RAM go OOM. So added following functionality:
1. If the length of current document is lesser than overlap then no need
to chunk it just return it.
2. Check if the start + overlap is lesser than content length to avoid
endless chunking.

Fixes #1703 

## Type of change

Please check the options that are relevant:

- [x] Bug fix (non-breaking change which fixes an issue)
- [] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Model update
- [ ] Infrastructure change

## Checklist

- [x] My code follows Phidata's style guidelines and best practices
- [x] I have performed a self-review of my code
- [ ] I have added docstrings and comments for complex logic
- [x] My changes generate no new warnings or errors
- [ ] I have added cookbook examples for my new addition (if needed)
- [ ] I have updated requirements.txt/pyproject.toml (if needed)
- [x] I have verified my changes in a clean environment
  • Loading branch information
Gruhit13 authored Jan 7, 2025
1 parent 2e64aa5 commit c583a95
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion phi/document/chunking/fixed.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,13 @@ def chunk(self, document: Document) -> List[Document]:
chunk_number = 1
chunk_meta_data = document.meta_data

# If the document length is less than overlap, it cannot be chunked.
if len(content) <= self.overlap:
return [document]

# run the chunking only if the length of the content is greater than the overlap.
start = 0
while start < content_length:
while start + self.overlap < content_length:
end = min(start + self.chunk_size, content_length)

# Ensure we're not splitting a word in half
Expand Down

0 comments on commit c583a95

Please sign in to comment.