Multi-langauge tokenization: English and Tibetan #492

moksamedia · 2025-04-20T10:00:45Z

moksamedia
Apr 20, 2025

I want to search a dataset that includes English and Tibetan. I would like to be able to tokenize both languages appropriately, which means using different tokenizers. Is it best to create two indexes: one for Tibetan and one for English? Or to make a hybrid tokenizer that tokenizes English for english and Tibetan for Tibetan but uses the same index?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multi-langauge tokenization: English and Tibetan #492

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Multi-langauge tokenization: English and Tibetan #492

Uh oh!

moksamedia Apr 20, 2025

Replies: 0 comments

moksamedia
Apr 20, 2025