Text Cleaning with Transformer Language Models for Hungarian

Gábor Madarász; András Holl; Noémi Ligeti-Nagy; Zijian Győző Yang; Tamás Váradi

doi:10.14232/actacyb.316365

Authors

Gábor Madarász HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary https://orcid.org/0009-0004-8572-3087
András Holl Library and Information Centre, Hungarian Academy of Sciences, Budapest, Hungary https://orcid.org/0000-0002-6873-3425
Noémi Ligeti-Nagy HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary https://orcid.org/0000-0003-0851-7621
Zijian Győző Yang HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary https://orcid.org/0000-0001-9955-860X
Tamás Váradi HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary https://orcid.org/0000-0001-5765-3908

DOI:

https://doi.org/10.14232/actacyb.316365

Keywords:

text cleaning, Transformer Language Models, Hungarian NLP, OCR correction, diacritic restoration, huT5 model

Abstract

In language technology, clean data is fundamental for training high-quality models, yet large corpora often contain substantial noise due to OCR errors, missing diacritics, and various user-generated inconsistencies. This paper presents a comprehensive text cleaning pipeline tailored for Hungarian, leveraging transformer-based language models optimized for three key tasks: OCR error correction, diacritic restoration, and filtering grammatically incorrect sentences. We introduce huT5, a Hungarian adaptation of the mT5 model, which reduces model parameters and resource demands while maintaining strong performance on Hungarian-specific text cleaning tasks. The huT5 models were fine-tuned on carefully constructed Hungarian corpora for each task and benchmarked against state-of-the-art methods, demonstrating competitive results, particularly in OCR error correction and diacritic restoration. Our pipeline offers an efficient, freely accessible solution to enhance data quality for Hungarian NLP applications, setting a new standard in resource-efficient, language-specific text cleaning.

Downloads

Download data is not yet available.

Text Cleaning with Transformer Language Models for Hungarian

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Developed By

Information

Make a Submission

Current Issue