Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish

Bayram, M. AliFincan, Ali ArdaGumus, Ahmet SemihKarakas, SercanDiri, BanuYildirim, Savas2026-04-042026-04-042025979-8-3315-6656-2979-8-3315-6655-52165-0608https://doi.org/10.1109/SIU66497.2025.11112220https://hdl.handle.net/11411/1059033rd Conference on Signal Processing and Communications Applications-SIU-Annual -- JUN 25-28, 2025 -- Istanbul, TURKIYETokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Furthermore, increasing model parameters alone does not necessarily enhance linguistic performance, underscoring the importance of tailored, language-specific tokenization methods. The proposed framework establishes robust and practical tokenization standards for morphologically complex languages.trinfo:eu-repo/semantics/openAccessTokenizationLarge Language Models (Llm)Natural Language Processing (Nlp)Turkish NlpTokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on TurkishConference Object2-s2.0-10501541486410.1109/SIU66497.2025.1111222010.1109/SIU66497.2025.11112220N/AN/AWOS:001575462500250