Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it oke for chunking other languages like vietnamese ? #7

Open
vanhdz2611 opened this issue Nov 27, 2024 · 2 comments
Open

Is it oke for chunking other languages like vietnamese ? #7

vanhdz2611 opened this issue Nov 27, 2024 · 2 comments

Comments

@vanhdz2611
Copy link

No description provided.

@Robot2050
Copy link
Collaborator

Robot2050 commented Nov 27, 2024

Thank you for your attention and inquiry regarding this work! According to the technical report from Qwen2, this series of large models supports Vietnamese. However, our experiments were primarily conducted in Chinese and English. Therefore, you will need to modify the split_text_by_punctuation(text, language) function within the chunking method you intend to use, replacing it with a method that supports Vietnamese sentence segmentation. No other modifications are required. The purpose of this function is to divide a long text into individual sentences, outputting a list of these sentences. Currently, this is the approach you can take to address the issue. Additionally, we are aware of some limitations in current methods and are engaged in further optimization efforts. You may continue to follow our progress.

  • Using nltk for sentence segmentation
import nltk
nltk.download('punkt')

def split_text_by_punctuation(text):
    sentences = nltk.tokenize.sent_tokenize(text, language='vietnamese')
    return sentences

text = "Chào bạn. Bạn có khỏe không? Tôi đang học lập trình Python. Hãy cùng nhau học nhé!"
sentences = split_text_by_punctuation(text)
  • Use re (regular expression) for simpler sentence segmentation
import re

def split_text_by_punctuation(text):
    sentences = re.split(r'(?<=[.!?;])\s+', text)
    return sentences

text = "Chào bạn. Bạn có khỏe không? Tôi đang học lập trình Python. Hãy cùng nhau; học nhé!"
sentences = split_text_by_punctuation(text)

@vanhdz2611
Copy link
Author

Thanks for your reply, i will try later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants