You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your attention and inquiry regarding this work! According to the technical report from Qwen2, this series of large models supports Vietnamese. However, our experiments were primarily conducted in Chinese and English. Therefore, you will need to modify the split_text_by_punctuation(text, language) function within the chunking method you intend to use, replacing it with a method that supports Vietnamese sentence segmentation. No other modifications are required. The purpose of this function is to divide a long text into individual sentences, outputting a list of these sentences. Currently, this is the approach you can take to address the issue. Additionally, we are aware of some limitations in current methods and are engaged in further optimization efforts. You may continue to follow our progress.
Using nltk for sentence segmentation
import nltk
nltk.download('punkt')
def split_text_by_punctuation(text):
sentences = nltk.tokenize.sent_tokenize(text, language='vietnamese')
return sentences
text = "Chào bạn. Bạn có khỏe không? Tôi đang học lập trình Python. Hãy cùng nhau học nhé!"
sentences = split_text_by_punctuation(text)
Use re (regular expression) for simpler sentence segmentation
import re
def split_text_by_punctuation(text):
sentences = re.split(r'(?<=[.!?;])\s+', text)
return sentences
text = "Chào bạn. Bạn có khỏe không? Tôi đang học lập trình Python. Hãy cùng nhau; học nhé!"
sentences = split_text_by_punctuation(text)
No description provided.
The text was updated successfully, but these errors were encountered: