Is it oke for chunking other languages like vietnamese ? #7

vanhdz2611 · 2024-11-27T04:27:37Z

No description provided.

Robot2050 · 2024-11-27T05:15:51Z

Thank you for your attention and inquiry regarding this work! According to the technical report from Qwen2, this series of large models supports Vietnamese. However, our experiments were primarily conducted in Chinese and English. Therefore, you will need to modify the split_text_by_punctuation(text, language) function within the chunking method you intend to use, replacing it with a method that supports Vietnamese sentence segmentation. No other modifications are required. The purpose of this function is to divide a long text into individual sentences, outputting a list of these sentences. Currently, this is the approach you can take to address the issue. Additionally, we are aware of some limitations in current methods and are engaged in further optimization efforts. You may continue to follow our progress.

Using nltk for sentence segmentation

import nltk
nltk.download('punkt')

def split_text_by_punctuation(text):
    sentences = nltk.tokenize.sent_tokenize(text, language='vietnamese')
    return sentences

text = "Chào bạn. Bạn có khỏe không? Tôi đang học lập trình Python. Hãy cùng nhau học nhé!"
sentences = split_text_by_punctuation(text)

Use re (regular expression) for simpler sentence segmentation

import re

def split_text_by_punctuation(text):
    sentences = re.split(r'(?<=[.!?;])\s+', text)
    return sentences

text = "Chào bạn. Bạn có khỏe không? Tôi đang học lập trình Python. Hãy cùng nhau; học nhé!"
sentences = split_text_by_punctuation(text)

vanhdz2611 · 2024-11-28T02:28:20Z

Thanks for your reply, i will try later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it oke for chunking other languages like vietnamese ? #7

Is it oke for chunking other languages like vietnamese ? #7

vanhdz2611 commented Nov 27, 2024

Robot2050 commented Nov 27, 2024 •

edited

Loading

vanhdz2611 commented Nov 28, 2024

Is it oke for chunking other languages like vietnamese ? #7

Is it oke for chunking other languages like vietnamese ? #7

Comments

vanhdz2611 commented Nov 27, 2024

Robot2050 commented Nov 27, 2024 • edited Loading

vanhdz2611 commented Nov 28, 2024

Robot2050 commented Nov 27, 2024 •

edited

Loading