Sequence Labeling Tasks for Odia Language

Dalai, Tusarkanta (2024) Sequence Labeling Tasks for Odia Language. PhD thesis.

PDF (Restricted up to 29/07/2027)
Restricted to Repository staff only
6Mb

Abstract

This comprehensive thesis delves into the intricate landscape of natural language processing (NLP) for low resource languages, primarily focusing on Odia, an Indo Aryan language spoken predominantly in the Indian state of Odisha. This language face unique challenges in the development of NLP applications due to the scarcity of linguistic resources and annotated data. The primary focus revolves around addressing the inherent limitations of low-resource languages, explicitly emphasizing the construction of annotated datasets and the subsequent development of systems for the fundamental task of Odia language, such as sequence labeling. The overarching objective is to contribute to the enhancement of NLP applications for Odia, particularly in the domains of Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and chunking. The first objective of the research involves an in-depth investigation into the construction of annotated datasets tailored for sequence labeling tasks. Given the scarcity of linguistic resources and annotated data in low-resource languages like Odia, the methodology adopted for dataset creation is meticulous and resourceful. This corpus, precisely curated and annotated, spans various domains, text types, and linguistic nuances. Its creation involved extensive data collection efforts, linguistic analysis, and annotation by domain experts, resulting in a valuable resource for Odia language research. This stage serves as the foundational building block for subsequent developments, ensuring the availability of high-quality annotated data for training and evaluation purposes. The second phase of the thesis focuses on the development of systems for sequence labeling tasks, commencing with POS tagging. A baseline model is established using Conditional Random Fields (CRF) for the development of Odia POS tagger. Subsequently, the thesis explores advanced modeling techniques, incorporating Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), and transformer models to refine and elevate the accuracy of the POS tagger. Each model is meticulously fine-tuned to the unique linguistic characteristics of the Odia language, offering a nuanced understanding of context and semantics. The third phase of the research extends the developed methodologies to the creation of a phrase chunking system. Leveraging the foundation laid by the annotated dataset and the insights gained from the POS tagging system, the chunking model is designed to capture syntactic structures and linguistic nuances specific to Odia. Similar to the POS tagging phase, CRF, CNN, LSTM, and transformer models are employed to iteratively enhance the results and adaptability of the chunking system. The final stage of the research culminates in the development of a Named Entity Recognition (NER) system. Drawing on the knowledge gained from the preceding phases, the NER system is crafted to identify and categorize named entities within Odia text. The utilization of diverse modeling approaches, including CRF, CNN, LSTM, and transformer models, ensures a comprehensive and nuanced understanding of named entities in the context of the Odia language. Throughout the thesis, an exhaustive evaluation process is undertaken to assess the performance of each developed system against established benchmarks and existing NLP tools for Odia. Comparative studies and detailed evaluations provide insights into the effectiveness, superiority, and practical utility of the proposed solutions. The robustness and adaptability of the developed frameworks across different domains and genres underscore their applicability in real-world scenarios. This thesis represents a significant and multifaceted effort aimed at addressing the challenges posed by low-resource languages in the realm of NLP. The construction of annotated datasets and the development of advanced systems for sequence labeling tasks in Odia not only contribute to linguistic research but also hold immense potential for linguistic preservation, digital inclusion, and technological innovation. The outcomes of this research pave the way for a future where all languages, irrespective of their resource constraints, can thrive in the digital age.

Item Type:	Thesis (PhD)
Uncontrolled Keywords:	Low resource language; Sequence Labeling; Part-of-speech tagging; Phrase Chunking; Named Entity Recognition; Conditional Random Field; Convolution Neural Network; Bi directional LSTM; Transformer; Odia language
Subjects:	Engineering and Technology > Computer and Information Science > Data Mining Engineering and Technology > Computer and Information Science > Networks
Divisions:	Engineering and Technology > Department of Computer Science Engineering
ID Code:	10631
Deposited By:	IR Staff BPCL
Deposited On:	06 Aug 2025 12:28
Last Modified:	06 Aug 2025 12:28
Supervisor(s):	Mishra, Tapas Kumar and Sa, Pankaj Kumar

Repository Staff Only: item control page