Education, Science, Technology, Innovation and Life
Open Access
Sign In

Constructing a Machine-Learning-Ready Dataset for Low-Resource Fonts: A Case Study on the Uyghur Mongolian Script

Download as PDF

DOI: 10.23977/infkm.2025.060209 | Downloads: 33 | Views: 139

Author(s)

Hongchen Chen 1, Chuhan Qiao 1, Kaitai Yang 2, Bilig Bato 1

Affiliation(s)

1 College of Materials Science and Art Design, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia Autonomous Region, China
2 Makarov Marine Engineering Academy, Jiangsu Ocean University, Lianyungang, Jiangsu, China

Corresponding Author

Bilig Bato

ABSTRACT

Progress in modern AI and machine learning depends on accessible, reproducible, and standardized datasets; however, data infrastructure for low-resource scripts remains limited. This study presents an end-to-end, fully reproducible pipeline that transforms Uyghur Mongolian font files into standardized, quality-controlled glyph datasets, accompanied by open protocols and tooling. A total of 13 Uyghur Mongolian fonts (Unicode U+1820–U+1842) were extracted, and five representative fonts were randomly selected for empirical evaluation. The proposed workflow achieved a 99.63% rendering success rate, with an average end-to-end processing time of approximately four seconds per font file. The resulting publication-grade glyph images are suitable for OCR benchmarking, digital typography, and document analysis. In addition, the pipeline supports SVG exports of native font outlines—preserving internal holes and contour hierarchies—for vector-level editing and stroke-aware style-transfer workflows. All procedures, configuration files, and scripts are fully documented to ensure one-click reproducibility; both code and data will be released under open licenses to facilitate reuse and extension. This dataset has also been validated through deep-feature analysis using a pretrained VGG19 model, confirming its structural consistency and suitability for machine-learning applications. By establishing a transparent, reusable, and machine-learning ready data foundation for an under-represented writing system, this work contributes to the digital preservation of cultural heritage and provides essential groundwork for future AI-based research on low-resource scripts.

KEYWORDS

Uyghur-Mongolian Script; Low-Resource Fonts; Glyph Extraction; Image Nor Malization; Dataset Construction; Deep Learning; OCR Preprocessing

CITE THIS PAPER

Hongchen Chen, Chuhan Qiao, Kaitai Yang, Bilig Bato, Constructing a Machine-Learning-Ready Dataset for Low-Resource Fonts: A Case Study on the Uyghur Mongolian Script. Information and Knowledge Management (2025) Vol. 6: 59-73. DOI: http://dx.doi.org/10.23977/infkm.2025.060209.

REFERENCES

[1] Batjargal B, Khaltarkhuu G, Kimura F, Maeda A. A Study of Traditional Mongolian Script Encodings and Rendering: Use of Unicode in OpenType Fonts[J]. International Journal on Asian Language Processing, 2011, 21(1): 23–43.
[2] Wurihan, Biligebatu. Standard Font Screening Method for Replicating Ancient Mongolian Kanjur Fonts[C]. Proceedings of the 2nd International Conference on Social Science, Public Health and Education (SSPHE 2018), 2019: 23–25.
[3] Zhou C, Ha M, Wu L. Direction-Aware Lightweight Framework for Traditional Mongolian Document Layout Analysis[J]. Applied Sciences, 2025, 15: 4594.
[4] Fan D, Sun Y, Wang Z, Peng Y. Online Mongolian Handwriting Recognition Based on Encoder–Decoder Structure with Language Model[J]. Electronics, 2023, 12: 4194.
[5] Pan Y, Fan D, Wu H, et al. A New Dataset for Mongolian Online Handwritten Recognition[J]. Scientific Reports, 2023, 13: 26.
[6] He R. Y., Kang S. H., Morel J.-M. A Formalization of Image Vectorization by Region Merging[J]. arXiv preprint, 2024.
[7] Zhu H.-K., Chong J. I., Hu T., Yi R., Lai Y.-K., Rosin P. L. SAMVG: A Multi stage Image Vectorization Model with the Segment Anything Model[J]. arXiv preprint, 2023.
[8] Liu Y.-T., Zhang Z., Guo Y.-C., Fisher M., Wang Z.-H., Zhang S.-H. DualVector: Unsupervised Vector Font Synthesis with Dual-Part Representation[J]. arXiv preprint, 2023.
[9] Bradski G. The OpenCV Library[J]. Dr. Dobb’s Journal of Software Tools, 2000.
[10] Felzenszwalb P F, Huttenlocher D P. Distance Transforms of Sampled Functions[J]. Theory of Computing, 2012, 8: 415–428.
[11] Harris C R, et al. Array Programming with NumPy[J]. Nature, 2020, 585: 357–362.
[12] Hunter J D. Matplotlib: A 2D Graphics Environment[J]. Computing in Science & Engineering, 2007, 9(3): 90–95.
[13] Otsu N. A Threshold Selection Method from Gray-Level Histograms[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1979, 9(1): 62–66.
[14] Hu M K. Visual Pattern Recognition by Moment Invariants[J]. IRE Transactions on Information Theory, 1962, 8(2): 179–187.
[15] Foroosh H, Zerubia J, Berthod M. Extension of Phase Correlation to Subpixel Registration[J]. IEEE Transactions on Image Processing, 2002, 11(3): 188–200.
[16] Canny J. A Computational Approach to Edge Detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1986, 8(6): 679–698.
[17] Rezatofighi H, et al. Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019.
[18] Borgefors G. Hierarchical Chamfer Matching: A Parametric Edge Matching Algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1988, 10(6): 849–865.
[19] Huttenlocher D P, Klanderman G A, Rucklidge W J. Comparing Images Using the Hausdorff Distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993, 15(9): 850–863.
[20] Wang Z, Bovik A C, Sheikh H R, Simoncelli E P. Image Quality Assessment: From Error Visibility to Structural Similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600–612.
[21] Li T M, Lukáč M, Gharbi M, Ragan-Kelley J. Differentiable Vector Graphics Rasterization for Editing and Learning[J]. ACM Transactions on Graphics (SIGGRAPH Asia), 2020, 39(6): Art. 193.

All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.