TajPersLexon: Tajik-Persian Lexical Resource for Low-Resource NLP

A paper titled "TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP" has been submitted to arXiv.org. The paper was submitted on May 7, 2026, by Mullosharaf K. Arabov. ¹

The research introduces TajPersLexon, a curated Tajik-Persian parallel lexical resource. It contains 40,112 word and short-phrase pairs. The resource is designed for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. ¹

The study compares three methodological approaches: a lightweight hybrid pipeline, neural sequence-to-sequence models, and retrieval methods. The evaluation was conducted using CPU-only benchmarks. ¹

The research found that neural and retrieval baselines achieved 98-99% top-1 accuracy. The hybrid model achieved 96.4% accuracy in an OCR post-correction task. All experiments used fixed random seeds for reproducibility. ¹

The paper was published in The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP 2026). The dataset, code, and models will be publicly released. ¹

How this was made. This article was assembled by Startupniti's editorial AI from the source listed in the right rail. The synthesis ran through our 4-model cascade (Gemini Flash Lite → GPT-4o-mini → DeepSeek → Llama 3.3 70B), logged to ops.llm_calls. Every fact traces to a citation. If a fact looks wrong, write to corrections.