Yazar "Diri, Banu" seçeneğine göre listele
Listeleniyor 1 - 11 / 11
Sayfa Başına Sonuç
Sıralama seçenekleri
Öğe A Hybrid Method for Extracting Turkish Part-Whole Relation Pairs from Corpus(IEEE, 2016) Sahin, Gurkan; Diri, Banu; Yildiz, TugbaExtraction of various semantic relation pairs from different sources (dictionary definitions, corpus etc.) with high accuracy is one of the most popular topics in natural language processing (NLP). In this study, a hybrid method is proposed to extract Turkish part-whole pairs from corpus. Corpus statistics, WordNet similarities and Word2Vec word vector similarities are used together in this study. Firstly, initial part-whole seeds are prepared and by using these seeds part-whole patterns are extracted from corpus. For each pattern, a reliability score is calculated and reliable patterns are selected to produce new pairs from corpus. Various reliability scores are used for new pairs. To measure success of method, 19 target whole words are selected and average 83% (first 10 pairs), 74% (first 20 pairs), 68% (first 30 pairs) precisions are obtained, respectively.Öğe A Study on Turkish Meronym Extraction Using a Variety of Lexico-Syntactic Patterns(Springer International Publishing Ag, 2016) Yildiz, Tugba; Yildirim, Savas; Diri, BanuIn this paper, we applied lexico-syntactic patterns to disclose meronymy relation from a huge Turkish raw text. Once, the system takes a huge raw corpus and extract matched cases for a given pattern, it proposes a list of whole-part pairs depending on their co-occur frequencies. For the purpose, we exploited and compared a list of pattern clusters. The clusters to be examined could fall into three types; general patterns, dictionary-based pattern, and bootstrapped pattern. We evaluated how these patterns improve the system performance especially within corpusbased approach and distributional feature of words. Finally, we discuss all the experiments with a comparison analysis and we showed advantage and disadvantage of the approaches with promising results.Öğe Acquisition of Turkish meronym based on classification of patterns(Springer, 2016) Yildiz, Tugba; Diri, Banu; Yildirim, SavasThe identification of semantic relations from a raw text is an important problem in Natural Language Processing. This paper provides semi-automatic pattern-based extraction of part-whole relations. We utilized and adopted some lexico-syntactic patterns to disclose meronymy relation from a Turkish corpus. We applied two different approaches to prepare patterns; one is based on pre-defined patterns that are taken from the literature, second automatically produces patterns by means of bootstrapping method. While pre-defined patterns are directly applied to corpus, other patterns need to be discovered first by taking manually prepared unambiguous seeds. Then, word pairs are extracted by their occurrence in those patterns. In addition, we used statistical selection on global data that is obtaining from all results of entire patterns. It is a whole-by-part matrix on which several association metrics such as information gain, T-score, etc., are applied. We examined how all these approaches improve the system accuracy especially within corpus-based approach and distributional feature of words. Finally, we conducted a variety of experiments with a comparison analysis and showed advantage and disadvantage of the approaches with promising results.Öğe An Integrated Approach to Automatic Synonym Detection in Turkish Corpus(Springer International Publishing Ag, 2014) Yildiz, Tugba; Yildirum, Savas; Diri, BanuIn this study, we designed a model to determine synonymy. Our main assumption is that synonym pairs show similar semantic and dependency relation by the definition. They share same meronym/holonym and hypernym/hyponym relations. Contrary to synonymy, hypernymy and meronymy relations can probably be acquired by applying lexico-syntactic patterns to a big corpus. Such acquisition might be utilized and ease detection of synonymy. Likewise, we utilized some particular dependency relations such as object/subject of a verb, etc. Machine learning algorithms were applied on all these acquired features. The first aim is to find out which dependency and semantic features are the most informative and contribute most to the model. Performance of each feature is individually evaluated with cross validation. The model that combines all features shows promising results and successfully detects synonymy relation. The main contribution of the study is to integrate both semantic and dependency relation within distributional aspect. Second contribution is considered as being first major attempt for Turkish synonym identification based on corpus-driven approach.Öğe Healthcare-Focused Turkish Medical LLM: Training on Real Patient-Doctor Question-Answer Data for Enhanced Medical Insight(Assoc Computing Machinery, 2025) Bayram, M. Ali; Diri, Banu; Yildirim, SavasThe development of a Turkish-specific Large Language Model (LLM) for healthcare presents a unique opportunity to enhance AI's accessibility and relevance for Turkish-speaking medical practitioners and patients. This study introduces a specialized Turkish Medical LLM fine-tuned on over 167,732 real patient-doctor question-answer pairs sourced from a trusted medical platform and capturing authentic linguistics in Turkish medical language. Utilizing models like LLAMA 3, the fine-tuning process was supported by Low-Rank Adaptation (LoRA) and involved innovative methods to mitigate catastrophic forgetting, including spherical linear interpolation (Slerp) merging. Evaluation of the model's performance through similarity scores, GPT-3.5 assessments, and expert reviews indicates significant improvement in the model's ability to generate medically accurate responses. This Turkish Medical LLM demonstrates potential to support medical decision-making and patient interaction in Turkish healthcare settings, offering an essential resource for enhancing AI inclusivity across languages.Öğe Pattern and Semantic Similarity Based Automatic Extraction of Hyponym-Hypernym Relation from Turkish Corpus(IEEE, 2015) Sahin, Gurkan; Diri, Banu; Yildiz, TugbaExtraction of semantic relations from various resources (Wikipedia, Web, corpus etc.) is an important issue in natural language processing. In this paper, automatic extraction of hyponym-hypernym pairs from Turkish corpus is aimed. For extraction of hyponym-hypernym pairs, pattern and semantic similarity based methods are used together. Patterns are extracted from initial hyponym-hypernym pairs and using patterns, hyponyms are extracted for various hypernyms. Incorrect candidate hyponyms are removed using document frequency and semantic similarity based elimination methods. After experiments for 14 hypernyms, average accuracy of 77% was obtained.Öğe Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on Turkish(Ieee, 2025) Bayram, M. Ali; Fincan, Ali Arda; Gumus, Ahmet Semih; Karakas, Sercan; Diri, Banu; Yildirim, SavasTokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Furthermore, increasing model parameters alone does not necessarily enhance linguistic performance, underscoring the importance of tailored, language-specific tokenization methods. The proposed framework establishes robust and practical tokenization standards for morphologically complex languages.Öğe TR-MMLU Benchmark for Large Language Models: Performance Evaluation, Challenges, and Opportunities for Improvement(Ieee, 2025) Bayram, M. Ali; Fincan, Ali Arda; Gumus, Ahmet Semih; Diri, Banu; Yildirim, Savas; Aytas, OnerLanguage models have made significant advancements in understanding and generating human language, achieving remarkable success in various applications. However, evaluating these models remains a challenge, particularly for resource-limited languages like Turkish. To address this issue, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is based on a meticulously curated dataset comprising 6,200 multiple-choice questions across 62 sections within the Turkish education system. This benchmark provides a standard framework for Turkish NLP research, enabling detailed analyses of LLMs' capabilities in processing Turkish text. In this study, we evaluated state-of-the-art LLMs on TR-MMLU, highlighting areas for improvement in model design. TR-MMLU sets a new standard for advancing Turkish NLP research and inspiring future innovations.Öğe Türkçe Tweetler üzerinde Makine Öğrenmesi ile Nefret Söylemi Tespiti(Osman SAĞDIÇ, 2021) Mayda, İslam; Diri, Banu; Yıldız, TuğbaSosyal medya ağlarının sayısının ve kullanımının artması beraberinde nefret söylemi içeriklerinin de daha çok paylaşılması problemini doğurmuştur. Gerek kamu otoriteleri gerekse sosyal medya ağlarının kendileri, artan nefret söylemiyle mücadele kapsamında çeşitli politikalar üretmektedir. Kullanıcılar tarafından üretilen verinin hacminin oldukça büyük olması nedeniyle nefret söylemi tespitinde otomatik sistemlere ihtiyaç duyulmaktadır. Özellikle son yıllarda başta İngilizce olmak üzere birçok dil üzerinde otomatik nefret söylemi çalışması yapılmış olmasına rağmen Türkçe üzerine kapsamlı bir çalışma henüz sunulmamıştır. Bu çalışma bu ihtiyaca karşılık vermek amacıyla yapılmıştır. Farklı hedef gruplara dair anahtar kelimelerin geçtiği 1000 adet Türkçe tweet toplanmış ve iki değerlendirici tarafından üç sınıflı (nefret söylemi, saldırgan ifade, hiçbiri) olarak ayrı ayrı etiketlenmiştir. Oluşturulan Türkçe nefret söylemi veri seti sonraki çalışmalarda kullanılabilmesi için kamuya açık olarak paylaşılmıştır. Bu veri seti üzerinde farklı özellik kümeleri ve farklı makine öğrenmesi algoritmaları kullanılarak çeşitli testler gerçekleştirilmiştir. Üç sınıflı veri seti üzerinde en yüksek performans %79,9 F-ölçüm değeri ile SMO (Sıralı Minimal Optimizasyon) algoritmasının kullanıldığı testte elde edilmiştir. Türkçe nefret söylemi tespitinde daha başarılı sonuçlar almak için veri seti boyutunun artırılması gerekirken, sunulan bu çalışmanın gelecekte yapılacak çalışmalara öncü niteliğinde olması beklenmektedir.Öğe Turkish synonym identi cation from multiple resources: monolingual corpus,mono/bilingual online dictionaries, and WordNe(2017) Yıldız, Tuğba; Diri, Banu; Yıldırım, SavaşIn this study, a model is proposed to determine synonymy by incorporating several resources. The modelextracts the features from monolingual online dictionaries, a bilingual online dictionary, WordNet, and a monolingualTurkish corpus. Once it has built a candidate list, it determines the synonymy for a given word by means of thosefeatures. All these resources and the approaches are evaluated. Taking all features into account and applying machinelearning algorithms, the model shows good performance of F-measure with 81.4%. The study contributes to the literatureby integrating several resources and attempting the rst corpus-driven synonym detection system for Turkish.Öğe Turkish synonym identification from multiple resources: monolingual corpus, mono/bilingual online dictionaries, and WordNet(TUBITAK SCIENTIFIC & TECHNICAL RESEARCH COUNCIL TURKEY, 2017) Yıldız, Tuğba; Diri, Banu; Yıldırım, SavaşIn this study, a model is proposed to determine synonymy by incorporating several resources. The model extracts the features from monolingual online dictionaries, a bilingual online dictionary, WordNet and a monolingual Turkish corpus. Once it has built a candidate list, it determines the synonymy for a given word by means of those features. All these resources and the approaches are evaluated. Taking all features into account and applying machine learning algorithms, the model shows good performance of F-Measure with 81.4%. The study contributes to the literature by integrating several resources and attempting the first corpus-driven synonym detection system for Turkish.











