Multi-langauge tokenization: English and Tibetan #492
Unanswered
moksamedia
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I want to search a dataset that includes English and Tibetan. I would like to be able to tokenize both languages appropriately, which means using different tokenizers. Is it best to create two indexes: one for Tibetan and one for English? Or to make a hybrid tokenizer that tokenizes English for english and Tibetan for Tibetan but uses the same index?
Beta Was this translation helpful? Give feedback.
All reactions