Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish
| dc.contributor.author | Bayram, M. Ali | |
| dc.contributor.author | Fincan, Ali Arda | |
| dc.contributor.author | Gumus, Ahmet Semih | |
| dc.contributor.author | Karakas, Sercan | |
| dc.contributor.author | Diri, Banu | |
| dc.contributor.author | Yildirim, Savas | |
| dc.date.accessioned | 2026-04-04T18:55:51Z | |
| dc.date.available | 2026-04-04T18:55:51Z | |
| dc.date.issued | 2025 | |
| dc.department | İstanbul Bilgi Üniversitesi | |
| dc.description | 33rd Conference on Signal Processing and Communications Applications-SIU-Annual -- JUN 25-28, 2025 -- Istanbul, TURKIYE | |
| dc.description.abstract | Tokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Furthermore, increasing model parameters alone does not necessarily enhance linguistic performance, underscoring the importance of tailored, language-specific tokenization methods. The proposed framework establishes robust and practical tokenization standards for morphologically complex languages. | |
| dc.description.sponsorship | Institute of Electrical and Electronics Engineers Inc | |
| dc.identifier.doi | 10.1109/SIU66497.2025.11112220 | |
| dc.identifier.doi | 10.1109/SIU66497.2025.11112220 | |
| dc.identifier.isbn | 979-8-3315-6656-2 | |
| dc.identifier.isbn | 979-8-3315-6655-5 | |
| dc.identifier.issn | 2165-0608 | |
| dc.identifier.scopus | 2-s2.0-105015414864 | |
| dc.identifier.scopusquality | N/A | |
| dc.identifier.uri | https://doi.org/10.1109/SIU66497.2025.11112220 | |
| dc.identifier.uri | https://hdl.handle.net/11411/10590 | |
| dc.identifier.wos | WOS:001575462500250 | |
| dc.identifier.wosquality | N/A | |
| dc.indekslendigikaynak | Web of Science | |
| dc.indekslendigikaynak | Scopus | |
| dc.language.iso | tr | |
| dc.publisher | Ieee | |
| dc.relation.ispartof | 2025 33Rd Signal Processing and Communications Applications Conference, Siu | |
| dc.relation.publicationcategory | Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı | |
| dc.rights | info:eu-repo/semantics/openAccess | |
| dc.snmz | KA_WoS_20260402 | |
| dc.snmz | KA_Scopus_20260402 | |
| dc.subject | Tokenization | |
| dc.subject | Large Language Models (Llm) | |
| dc.subject | Natural Language Processing (Nlp) | |
| dc.subject | Turkish Nlp | |
| dc.title | Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish | |
| dc.type | Conference Object |











