Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish

dc.contributor.authorBayram, M. Ali
dc.contributor.authorFincan, Ali Arda
dc.contributor.authorGumus, Ahmet Semih
dc.contributor.authorKarakas, Sercan
dc.contributor.authorDiri, Banu
dc.contributor.authorYildirim, Savas
dc.date.accessioned2026-04-04T18:55:51Z
dc.date.available2026-04-04T18:55:51Z
dc.date.issued2025
dc.departmentİstanbul Bilgi Üniversitesi
dc.description33rd Conference on Signal Processing and Communications Applications-SIU-Annual -- JUN 25-28, 2025 -- Istanbul, TURKIYE
dc.description.abstractTokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Furthermore, increasing model parameters alone does not necessarily enhance linguistic performance, underscoring the importance of tailored, language-specific tokenization methods. The proposed framework establishes robust and practical tokenization standards for morphologically complex languages.
dc.description.sponsorshipInstitute of Electrical and Electronics Engineers Inc
dc.identifier.doi10.1109/SIU66497.2025.11112220
dc.identifier.doi10.1109/SIU66497.2025.11112220
dc.identifier.isbn979-8-3315-6656-2
dc.identifier.isbn979-8-3315-6655-5
dc.identifier.issn2165-0608
dc.identifier.scopus2-s2.0-105015414864
dc.identifier.scopusqualityN/A
dc.identifier.urihttps://doi.org/10.1109/SIU66497.2025.11112220
dc.identifier.urihttps://hdl.handle.net/11411/10590
dc.identifier.wosWOS:001575462500250
dc.identifier.wosqualityN/A
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isotr
dc.publisherIeee
dc.relation.ispartof2025 33Rd Signal Processing and Communications Applications Conference, Siu
dc.relation.publicationcategoryKonferans Öğesi - Uluslararası - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.snmzKA_WoS_20260402
dc.snmzKA_Scopus_20260402
dc.subjectTokenization
dc.subjectLarge Language Models (Llm)
dc.subjectNatural Language Processing (Nlp)
dc.subjectTurkish Nlp
dc.titleTokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish
dc.typeConference Object

Dosyalar