C&SA | Contents

Volume 60 >>> № 1 JANUARY — FEBRUARY 2024

-->

DOI 10.34229/KCA2522-9664.24.1.1

UDC 51.681.3

A.V. Anisimov¹, I.O. Zavadskyi, T.S². Chudakov³

¹ Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

a.v.anisimov@knu.ua

² Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

ihorzavadskyi@knu.ua

³ Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

timofey.chudakov@gmail.com

NATURAL-LANGUAGE TEXT COMPRESSION USING REVERSE MULTI-DELIMITER CODES

Abstract. We study a class of binary reverse multi-delimiter (RMD) data compression codes in application to natural language text compression. The RMD-codewords start with delimiters, i.e., prefixes of the form that cannot occur in other places of the codeword. The position of the delimiter in an RMD codeword differs from its position in “direct” multi-delimiter (MD) codes, where delimiters are codeword suffixes. RMD and MD codes possess many useful properties, such as unique decodability, completeness, universality, synchronizability, asymptotic densities, and finite automaton acceptability. For RMD-codes, we construct a monotonic mapping from the set of natural numbers to the set of codewords. For original MD-codes, hitherto, this was an open question. The discovered mapping and the byte quantification of a decoding automaton allow us to develop very fast byte-aligned algorithms for decoding and direct Boyer–Moore style search in compressed files. Compared with the known byte (SCDC) and Fibonacci codes, RMD codes demonstrate the best compression ratio on natural language texts (more than four times closer to the entropy bound than that of SCDC). Computer experiments demonstrate that RMD codes can be decoded almost as fast as SCDC and times faster than Fibonacci codes. In natural language text compression, we also practiced the RMD-encoding as a preprocessing tool, which improves the performance of the known modern powerful archivers.

Keywords: compression, archiver, code, multi-delimiter.

full text

REFERENCES

Duda J., Tahboub K., Gadgil N., Delp E. The use of asymmetric numeral systems as an accurate replacement for Huffman coding. Proc. 2015 Picture Coding Symposium (PCS 2015) (31 May – 3 June 2015, Cairns, Australia). Cairns, 2015. P. 65–69. https://doi.org/10.1109/PCS.2015.7170048 .

Huffman D.A. A method for the construction of minimum-redundancy codes. Proc. IRE. 1952. Vol. 40, N. 9. P. 1098–1101. https://doi.org/10.1109/JRPROC.1952.273898.

Elias P. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory. 1975. Vol. 21, N 2. P. 194–203. https://doi.org/10.1109/TIT.1975.1055349.

Lakshmanan K. On universal codeword sets. IEEE Transactions on Information Theory. 1981. Vol. 27, N 5. P. 659–662. https://doi.org/10.1109/TIT.1981.1056387.

Guibas L., Odlyzko A. Maximal prefix-synchronized codes. SIAM Journal on Applied Mathematics. 1978. Vol. 35, N 2. P. 401–418. URL: http://www.jstor.org/stable/2100678.

Capocelli R., de Santis A. Regular universal codeword sets (Corresp.). IEEE Transactions on Information Theory. 1986. Vol. 32, N 1. P. 129–133. https://doi.org/10.1109/TIT.1986.1057130.

Klein S.T., Ben-Nissan M.K. On the usefulness of Fibonacci compression codes. The Computer Journal. 2010. Vol. 53, N 6. P. 701–716. https://doi.org/10.1093/comjnl/bxp046.

Anisimov A., Zavadskyi I. Variable-length prefix codes with multiple delimiters. IEEE Transactions on Information Theory. 2017. Vol. 63, N 5. P. 2885–2895. https://doi.org/10.1109/TIT.2017.2674670.

Apostolico A., Fraenkel A.S. Robust transmission of unbounded strings using Fibonacci representations. IEEE Transactions on Information Theory. 1987. Vol. 33, N 2. P. 238–245. https://doi.org/10.1109/TIT.1987.1057284 .

Capocelli R. Comments and additions to “Robust transmission of unbounded strings using Fibonacci representations”. IEEE Transactions on Information Theory. 1989. Vol. 35, N 1. P. 191–193. https://doi.org/10.1109/18.42193.

de Moura E.S., Navarro G., Ziviani N., Baeza-Yates R. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems. 2000. Vol. 18, N 2. P. 113–119. URL: https://dl.acm.org/doi/pdf/10.1145/348751.348754 .

Brisaboa N., Iglesias E., Navarro G., Parama J. An efficient compression code for text databases. Proc. 25th European Conference on IR Research (ECIR 2003) (14–16 April 2003, Pisa, Italy). Pisa, 2003. LNCS. 2003. Vol. 2633. P. 468–481. https://doi.org/10.1007/3-540-36618-0_33.

Brisaboa N., A., Navarro G., Esteller M. -dense coding: an optimized compression code for natural language text databases. Proc. 10th International Symposium on String Processing and Information Retrieval (8–10 October 2003, Manaus, Brazil). Manaus, 2003. LNCS. 2003. Vol. 2857. P. 122–136. https://doi.org/10.1007/978-3-540-39984-1_10 .

Culpepper J.S., Moffat A. Enhanced byte codes with restricted prefix properties. Proc. 12th International Conference on String Processing and Information Retrieval (SPIRE 2005) (2–4 November 2005, Buenos Aires, Argentina). Buenos Aires, 2005. LNCS. 2005. Vol. 3772. P. 1–12. https://doi.org/10.1007/11575832_1 .

Zavadskyi I., Anisimov A. A family of data compression codes with multiple delimiters. Proc. Prague Stringology Conference 2016 (29–31 August 2016, Prague, Czech Republic). Prague, 2016. P. 71–84.

Zavadskyi I., Anisimov A. Reverse multi-delimiter compression codes. Proc. 2020 Data Compression Conference (DCC) (24-27 March 2020, Snowbird, UT, USA). Snowbird, 2020. P. 173–182. https://doi.org/10.1109/DCC47342.2020.00025 .

Zavadskyi I., Zavadska V. Reverse multi-delimiter codes in English and Ukrainian natural language text compression. Proc. VIII International Scientific Conference “Information Technology and Implementation" (IT&I-2021) (1–3 December 2021, Kyiv, Ukraine). Kyiv, 2021. CEUR Workshop Proceedings. 2022. Vol. 3132. P. 211–219. URL: https://ceur-ws.org/Vol-3132/Short_1.pdf .

Zavadskyi I. Binary-coded ternary number representation in natural language text compression. Proc. 2022 Data Compression Conference (DCC) (22-25 March 2022, Snowbird, UT, USA). Snowbird, 2022. P. 419–428. https://doi.org/10.1109/DCC52660.2022.00050 .

Skibinski P., Grabowski S., Deorowicz S. Revisiting dictionary-based compression. Software: Practice & Experience. 2005. Vol. 35, N 15. P. 1455–1476. https://doi.org/10.1002/spe.678.

Sun W., Zhang N., Mukherjee A. A dictionary-based multi-corpora text compression system. Proc. 2003 IEEE Data Compression Conference (DCC 2003) (25-27 March 2003, Snowbird, UT, USA). Snowbird, 2003. p. 448. https://doi.org/10.1109/DCC.2003.1194067.

Awan F., Mukherjee A. LIPT: A lossless text transform to improve compression. Proc. International Conference on Information Technology: Coding and Computing (02–04 April 2001, Las Vegas, NV, USA). Las Vegas, 2001. P. 452–460. https://doi.org/10.1109/ITCC.2001.918838.

Adiego J., Martinez-Prieto M.A., de la Fuente P. High-performance word-codeword mapping algorithm on PPM. Proc. 2009 Data Compression Conference (DCC 2009) (16-18 March 2009, Snowbird, UT, USA). Snowbird, 2009. P. 23–32. https://doi.org/10.1109/DCC.2009.40.

Cleary J., Witten I. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications. 1984. Vol. 32, Iss. 4. P. 396–402. https://doi.org/10.1109/TCOM.1984.1096090 .

PIZZA&CHILI CORPUS — English texts. URL: http://pizzachili.dcc.uchile.cl/texts/nlang/.

Zavadskyi I. Fast exact pattern matching by the means of a character bit representation. SN Computer Science. 2022. Vol. 3, Iss. 3. Article number 181. P. 1–20. https://doi.org/10.1007/s42979-022-01052-w .