• جزئیات بیشتر مقاله
    • تاریخ ارائه: 1391/01/01
    • تاریخ انتشار در تی پی بین: 1391/01/01
    • تعداد بازدید: 846
    • تعداد پرسش و پاسخ ها: 0
    • شماره تماس دبیرخانه رویداد: -
     in linguistics, a text corpus is defined as a large group of text documents. text corpora are used in order to extract the hidden laws of languages. as one application for statistical researches and hidden laws extraction, language models are made to be used for information retrieval applications. in this paper we introduce one of the greatest text corpora in islamic science which is called noor corpus, and then we provide the language model of this corpus. the noor corpus is results of a decade of efforts from theological researchers and computer engineers of computer research center of islamic sciences (crcis). this corpus includes thousands of islamic books are classified into different categories. most of the existing texts are arabic and persian. there are 1.2 billion arabic words as well as 616 million persian words. the bigram language models of this corpus have 80 million distinct bigram words in arabic and 44 million distinct bigram words in persian.

سوال خود را در مورد این مقاله مطرح نمایید :

با انتخاب دکمه ثبت پرسش، موافقت خود را با قوانین انتشار محتوا در وبسایت تی پی بین اعلام می کنم
مقالات جدیدترین رویدادها
مقالات جدیدترین ژورنال ها