Text Cleaning with Transformer Language Models for Hungarian

Authors

DOI:

https://doi.org/10.14232/actacyb.316365

Keywords:

text cleaning, Transformer Language Models, Hungarian NLP, OCR correction, diacritic restoration, huT5 model

Abstract

In language technology, clean data is fundamental for training high-quality models, yet large corpora often contain substantial noise due to OCR errors, missing diacritics, and various user-generated inconsistencies. This paper presents a comprehensive text cleaning pipeline tailored for Hungarian, leveraging transformer-based language models optimized for three key tasks: OCR error correction, diacritic restoration, and filtering grammatically incorrect sentences. We introduce huT5, a Hungarian adaptation of the mT5 model, which reduces model parameters and resource demands while maintaining strong performance on Hungarian-specific text cleaning tasks. The huT5 models were fine-tuned on carefully constructed Hungarian corpora for each task and benchmarked against state-of-the-art methods, demonstrating competitive results, particularly in OCR error correction and diacritic restoration. Our pipeline offers an efficient, freely accessible solution to enhance data quality for Hungarian NLP applications, setting a new standard in resource-efficient, language-specific text cleaning.

Downloads

Download data is not yet available.

Downloads

Published

2026-06-02

How to Cite

Madarász, G., Holl, A., Ligeti-Nagy, N., Yang, Z. G., & Váradi, T. (2026). Text Cleaning with Transformer Language Models for Hungarian. Acta Cybernetica. https://doi.org/10.14232/actacyb.316365

Issue

Section

Special Issue of the 21th Conference on Hungarian Computational Linguistics