A repository collection as a data corpus

Authors

  • András Holl
    Affiliation
    MTA Könyvtár és Információs Központ
  • Gábor Prószéky
    Affiliation
    Nyelvtudományi Kutatóközpont
  • Tamás Váradi
    Affiliation
    Nyelvtudományi Kutatóközpont
  • László Laki
    Affiliation
    Nyelvtudományi Kutatóközpont
https://doi.org/10.3311/tmt.13239

Abstract

The article presents the use of the REAL repository as a text corpus of modern Hungarian-language content, which is jointly implemented by the Library and Information Centre of the Hungarian Academy of Sciences (MTA KIK) and the Hungarian Research Centre for Linguistics (NYTK), in the framework of the National Programme of the Hungarian Academy of Sciences "Science for the Hungarian language". The sub-programme "Digital support of the Hungarian language in the service of Hungarian science" will be completed in 2026 and is based on the teaching and application of the neural network-based language technology tool developed by the NYTK.

REAL is one of the most important Hungarian scientific repositories, organised into eight collections with more than 210,000 items, the vast majority of which are freely downloadable. Nearly half a million downloads are made from the repository every month. The new project will attempt to use part of the textual content as a corpus of texts. To do this, we need to get to know our own collection better than ever before, to assess its data content and to examine the quality of the documents and their descriptive data. We must then seek to improve and complete the descriptive data before using language technology tools. As a result of this project, it is hoped that the repository data can be further improved.

Part of the project involves mining the corpus of texts: we will identify the basic bibliographic information in the texts and use it both to enrich the description of the original documents and to supplement other databases (Hungarian Science Bibliography, MTMT). We also deal with the thematic classification of texts. All of these tasks will be carried out using the open source, free-to-use EPrints software used in REAL.

Keywords:

repozitórium, gyűjtemény, adatkorpusz, REAL, könyvtár

Published Online

2023-06-22

How to Cite

Holl, A., Prószéky, G., Váradi, T., Laki, L. “A repository collection as a data corpus”, Scientific and Technical Information, 70(2), pp. 164–167, 2023. https://doi.org/10.3311/tmt.13239

Issue

Section

Articles