DOI
10.34229/KCA2522-9664.25.2.11
UDC 004.04, 004.65
IMPLEMENTATION OF A SUFFIX TREE-BASED INDEX
FOR SEARCHING SUBSTRINGS IN A LARGE DBMS
Abstract. The article considers the advantages and disadvantages of implementing a suffix tree-based index to optimize substring search operations in a DBMS when working with large data. The theoretical characteristics of the complexity of operations for suffix trees are presented. Experimental estimates of the time complexity of substring search operations for suffix trees and database management systems such as Elasticsearch, PostgreSQL, MySQL, ClickHouse are carried out. Based on the results obtained, the hypothesis about the potential efficiency of implementing an index based on suffix trees to optimize substring search operations in a DBMS is confirmed.
Keywords: suffix tree, index, string searching, performance benchmark, Elasticsearch, PostgreSQL, MySQL, ClickHouse.
full text
REFERENCES
- 1. Bazilevych K.O., Chumachenko D.I., Hulianytskyi L.F., Meniailov I.S., Yakovlev S.V. Intelligent decision-support system for epidemiological diagnostics. I. A concept of architecture design Cybernetics and Systems Analysis. 2022. Vol. 58, N 3. P. 343–353. https://doi.org/10.1007/ .
- 2. Hlybovets A.M. Agent-based software systems for information search and analysis [in Ukrainian]. Kyiv: Publishing House "Kyiv-Mohyla Academy", 2019. 278 p.
- 3. Hlybovets A., Didenko V. Constructing generalized suffix trees on distributed parallel platforms. Cybernetics and Systems Analysis. 2023. Vol. 59, N . P. 49–60. https://doi.org/ 10.1007/ .
- 4. Comer D. Ubiquitous B-Tree. ACM Comput. Surv. 1979. Vol. 11. P. 121–137. https://doi.org/10.1145/ .
- 5. Kim M., Whang K., Lee J., Lee M. n-Gram/2L: A space and time efficient two-level n-gram inverted index structure. Proc. of the 31st international conference on Very large data bases (VLDB’05). VLDB Endowment. 2005. P. 325–336.
- 6. Weiner P. Linear pattern matching algorithms. Proc. 14th Annual IEEE Symposium on Switching and Automata Theory (SWAT 1973) (15–17 Oct. 1973, Iowa City, Iowa, USA). Iowa City, 1973. P. 1–11. https://doi.org/10.1109/ .
- 7. Knuth D.E., Morris J.H., Pratt V.R. Fast pattern matching in strings. SIAM J. Comput. 1977. Vol. 6. P. 323–350. https://doi.org/10.1137/ .
- 8. Jacobson G. Space-efficient static trees and graphs. Proc. of the 30th Annual Symposium on Foundations of Computer Science (SFC’89). IEEE Computer Society, USA, 1989. P. 549–554. https://doi.org/10.1109/ .
- 9. Clark D.R., Munro J.I. Efficient suffix trees on secondary storage. Proc. of the seventh annual ACM-SIAM symposium on Discrete algorithms (SODA ‘96). Society for Industrial and Applied Mathematics, USA. 1996. P. 383–391.
- 10. Colussi L., de Col A. A time and space efficient data structure for string searching on large texts. Information Processing Letters. 1996. Vol. 58, Iss. 5. P. 217–222. https://doi.org/ 10.1016/ .
- 11. Munro J.I., Raman V., Rao S.S. Space efficient suffix trees. Journal of Algorithms. 2001. Vol. 39. P. 205–222. https://doi.org/10.1006/ .
- 12. Manber U., Myers G. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing. 1993. Vol. 22, N 5. P. 935–948. https://doi.org/10.1137/ .
- 13. Grossi R., Vitter J.S. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing. Comput. 2005. Vol. 35, N 2. P. 378–407. https://doi.org/10.1137/ .
- 14. Ferragina P., Manzini G. Opportunistic data structures with applications. Proc. 41st Annual Symposium on Foundations of Computer Science. Redondo Beach, CA, USA. 2000. P. 390–398. https://doi.org/10.1109/ .
- 15. Kempa D., Kociumaka T. Breaking the -barrier in the construction of compressed suffix arrays and suffix trees. Proc. of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics. 2023. P. 5122–5202. https://doi.org/10.1137/ .
- 16. Ukkonen E. On-line construction of suffix trees. Algorithmica. 1995. Vol. 14, N 3. P. 249–260. https://doi.org/10.1007/ .
- 17. Ghoting A., Makarychev K. Suffix trees. Encyclopedia of Parallel Computing. Padua D. (Ed.), Boston, MA: Springer, 2011. P. 1945–1955. https://doi.org/10.1007/ .
- 18. Mansour E., Allam A., Skiadopoulos S., Kalnis P. ERA: efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow. 2011. Vol. 5, N 1. P. 49–60. https://doi.org/10.14778/ .
- 19. Zhu G., Guo C., Lu L., Huang Z., Yuan C., Gu R., Huang Y. DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms. Parallel Computing. 2019. Vol. 87. P. 87–102. https://doi.org/10.1016/ .
- 20. Xu W., Chen H., Huan Y., Hu X., Nong G. Full-text search engine with suffix index for massive heterogeneous data. Information Systems. 2022. Vol. 104. https://doi.org/10.1016/ .
- 21. Experiment source codes. https://github.com/Uaman/ .
- 22. Bahgat A. Implementing a suffix tree. https://github.com/abahgat/ .
- 23. Extended implementation of suffix tree used for experiment. https://github.com/Uaman/ .
- 24. Gusfield D. Algorithms on Strings, Trees, and Sequences-Computer Science and Computational Biology. Cambridge: Cambridge University Press. 1997. 556 с. https://doi.org/ 10.1017/ .
- 25. Elasticsearch. https://www.elastic.co/ .
- 26. PostgreSQL https:/www.postgresql.org .
- 27. MySQL. https:/www.mysql.com/ .
- 28. ClickHouse. https://clickhouse.com/ .