Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish

Bayram, M. Ali; Fincan, Ali Arda; Gumus, Ahmet Semih; Karakas, Sercan; Diri, Banu; Yildirim, Savas

Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish

Tarih

2025

Yazarlar

Yayıncı

Ieee

Erişim Hakkı

info:eu-repo/semantics/openAccess

Özet

Tokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Furthermore, increasing model parameters alone does not necessarily enhance linguistic performance, underscoring the importance of tailored, language-specific tokenization methods. The proposed framework establishes robust and practical tokenization standards for morphologically complex languages.

Açıklama

33rd Conference on Signal Processing and Communications Applications-SIU-Annual -- JUN 25-28, 2025 -- Istanbul, TURKIYE

Anahtar Kelimeler

Tokenization, Large Language Models (Llm), Natural Language Processing (Nlp), Turkish Nlp

Kaynak

2025 33Rd Signal Processing and Communications Applications Conference, Siu

WoS Q Değeri

N/A

Scopus Q Değeri

N/A

Bağlantı

https://doi.org/10.1109/SIU66497.2025.11112220
https://hdl.handle.net/11411/10590

Koleksiyon

Web of Science Indexed Publications
Scopus Indexed Publications

Detaylı Öğe Kaydı

Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish

Tarih

Yazarlar

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

Özet

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Bağlantı

Koleksiyon