C&SA | Contents

Volume 55 >>> № 5 SEPTEMBER — OCTOBER 2019

UDC 004.22+004.93'11

D.A. Rachkovskij¹

¹ International Research and Training Center for Information Technologies and Systems, NAS of Ukraine and MES of Ukraine, Kyiv, Ukraine

dar@infrm.kiev.ua

INDEX STRUCTURES FOR FAST SIMILARITY SEARCH FOR SYMBOLIC STRINGS

Abstract. We survey index structures for fast similarity search of objects represented by symbolic strings. Index structures for both exact and approximate search by the edit distance are considered. Mainly, we present index structures based on inverted indexing, similarity-preserving hashing, tree structures. The ideas of specific algorithms, including the recently proposed ones, are outlined.

Keywords: similarity search, edit distance, nearest neighbor, near neighbor, index structures, inverted indexing, n-grams, locality-sensitive hashing, treelike structures.

FULL TEXT

REFERENCES

Rachkovskij D.A. Real-valued vectors for fast distance and similarity estimation. Cybernetics and Systems Analysis. 2016. Vol. 52, N 6. P. 967–988.

Rachkovskij D.A. Binary vectors for fast distance and similarity estimation. Cybernetics and Systems Analysis. 2017. Vol. 53, N 1. P. 138–156.

Rachkovskij D.A. Distance-based index structures for fast similarity search. Cybernetics and Systems Analysis. 2017. Vol. 53, N 4. P. 636–658.

Rachkovskij D.A. Index structures for fast similarity search for binary vectors. Cybernetics and Systems Analysis. 2017. Vol. 53, N 5. P. 799–820.

Rachkovskij D.A. Index structures for fast similarity search for real-valued vectors. I. Cybernetics and Systems Analysis. 2018. Vol. 54, N 1. P. 152–164.

Rachkovskij D.A. Index structures for fast similarity search for real-valued vectors. II. Cybernetics and Systems Analysis. 2018. Vol. 54, N 2. P. 320–335.

Boytsov L. Indexing methods for approximate dictionary searching: Comparative analysis. J. Exp. Algorithmics. 2011. Vol. 16. P. 1.1:1–1.1:91.

Jiang Y., Li G., Feng J., Li W. String similarity joins: An experimental evaluation. Proc. VLDB Endowment. 2014. Vol. 7, N 8. P. 625–636.

Yu M., Li G., Deng D., Feng J. String similarity search and join: a survey. Frontiers of Computer Science. 2016. Vol. 10, N 3. P. 399–417.

Backurs A., Indyk P. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). Proc. STOC’15. 2015. P. 51–58.

Andoni A., Indyk P. Nearest neighbors in high-dimensional spaces. In: Handbook of Discrete and Computational Geometry. 3rd edition. Chap. 43. Boca Raton, USA: CRC Press, 2017. P. 1133–1153.

Andoni A., Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM. 2008. Vol. 51, N 1. P. 117–122.

Mann W., Augsten N., Bouros P. An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 2016. Vol. 9, N 9. P. 636–647.

Jia L., Zhang L.,Yu G., You J., Ding J., Li M. A survey on set similarity search and join. International Journal of Performability Engineering. 2018. Vol. 14, N 2. P. 245–258.

Manber U., Wu S. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett. 1994. Vol. 50, N 4. P. 191–197.

Chegrane I., Belazzougui D. Simple, compact and robust approximate string dictionary. J. Discrete Algorithms. 2014. Vol. 28. P. 49–60.

Belazzougui D. Faster and space-optimal edit distance “1” dictionary. Proc. CPM’09. 2009. P. 154–167.

Belazzougui D., Venturini R. Compressed string dictionary search with edit distance one. Algorithmica. 2016. Vol. 74, N 3. P. 1099–1122.

Chan T., Lewenstein M. Fast string dictionary lookup with one error. Proc. CPM’15. 2015. P. 114–123.

Fredman M.L., Komlos J., Szemeredi E. Storing a sparse table with O(1) worst case access time. Journal of the ACM. 1984. Vol. 31, N 3. P. 538–544.

Karp R.M., Rabin M.O. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development. 1987. Vol. 31, N 2. P. 249–260.

Mor M., Fraenkel A.S. A Hash code method for detecting and correcting spelling errors. Communications of the ACM. 1982.Vol. 25, N 12. P. 935–938.

Muth R., Manber U. Approximate multiple string search. Proc. CPM’96. 1996. P. 75–86.

Broder A., Mitzenmacher M. Network applications of bloom filters: A survey. Internet Mathematics. 2004. Vol. 1, N 4. P. 485–509.

Karch D., Luxen D., Sanders P. Improved fast similarity search in dictionaries. Proc. SPIRE’10. 2010. P. 173–178.

Cole R., Gottlieb L.-A., Lewenstein M. Dictionary matching and indexing with errors and don’t cares. Proc. STOC’04. 2004. P. 91–100.

Chan H., Lam T.W., Sung W., Tam S., Wong S. Compressed indexes for approximate string matching. Algorithmica. 2010. Vol. 58, N 2. P. 263–281.

Sokolov А.M. Vector representations for efficient comparison and search for similar strings. Cybernetics and System Analysis. 2007. Vol. 43, N 4. P. 484–498.

Sokolov A.M. Investigation of accelerated search for close text sequences with the help of vector representations. Cybernetics and Systems Analysis. 2008. Vol. 44, N 4. P. 493–506.

Datar M., Immorlica N., Indyk P., Mirrokni V.S. Locality-sensitive hashing scheme based on p-stable distributions. Proc. SCG’04. 2004. P. 253–262.

Andoni A., Datar M., Immorlica N., Indyk P., Mirrokni V. Locality-Sensitive Hashing using stable distributions. In: Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Shakhnarovich G., Darrell T., Indyk P. (Eds.). Cambridge, MA: MIT Press, 2006. P. 61–72.

Bawa M., Condie T., Ganesan P. Lsh forest: self-tuning indexes for similarity search. Proc. WWW’05. 2005. P. 651–660.

Andoni A., Razenshteyn I., Shekel Nosatzki N. Lsh forest: Practical algorithms made theoretical. Proc. SODA’17. 2017. P. 67–78.

Zhang H., Zhang Q. EmbedJoin: efficient edit similarity joins via embeddings. Proc. KDD’17. 2017. P. 585–594.

Chakraborty D., Goldenberg E., Koucky M. Streaming algorithms for embedding and computing edit distance in the low distance regime. Proc. STOC’16. 2016. P. 712–725.

Li G., Deng D., Wang J., Feng J. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endowment. 2011. Vol. 5, N 3. P. 253–264.

Xiao C., Wang W., Lin X. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endowment. 2008. Vol 1, N 1. P. 933–944.

Wang J., Li G., Feng J. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. Proc. SIGMOD’12. 2012. P. 85–96.

Qin J., Wang W., Lu Y., Xiao C., Lin X. Efficient exact edit similarity query processing with the asymmetric signature scheme. Proc. SIGMOD’11. 2011. P. 1033–1044.

Jokinen P., Ukkonen E. Two algorithms for approximate string matching in static texts. Proc. MFCS’91. 1991. P. 240–248.

Gravano L., Ipeirotis P.G., Jagadish H.V., Koudas N., Muthukrishnan S., Srivastava D. Approximate string joins in a database (almost) for free. Proc. VLDB’01. 2001. P. 491–500.

Li C., Wang B., Yang X.. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. Proc. VLDB’07. 2007. P. 303–314.

Yang X., Wang B., Li C. Cost-based variablelength-gram selection for string collections to support approximate queries efficiently. Proc. SIGMOD’08. 2008. P. 353–364.

Kahveci T., Singh A. An efficient index structure for string databases. Proc. VLDB’01. 2001. P. 351–360.

Jiang Y., Deng D., Wang J., Li G., Feng J. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. Proc. EDBT’13. 2013. P. 341–348.

Wei H., Yu J.X., Lu C. String similarity search: a hash-based approach. IEEE Transactions on Knowledge and Data Engineering. 2018. Vol. 30, N 1. P. 170–184.

Vernicaand R., Li C. Efficient top-k algorithms for fuzzy search in string collections. Proc. KEYS’09. 2009. P. 9–14.

Deng D., Li G., Feng J. A pivotal prefix based filtering algorithm for string similarity search. Proc. SIGMOD’14. 2014. P. 673–684.

Chaudhuri S., Ganti V., Kaushik R. A primitive operator for similarity joins in data cleaning. Proc. ICDE’06. 2006. P. 5–16.

Ukkonen E. Approximate string-matching over suffix trees. In: Combinatorial Pattern Matching. CPM 1993. Lecture Notes in Computer Science. Apostolico A., Crochemore M., Galil Z., Manber U. (Eds.). 1993. Vol 684. P. 228–242.

Bocek T., Hunt E., Hausheer D., Stiller B. Fast similarity search in peer-to-peer networks. Proc. NOMS’08. 2008. P. 240–247.

Wang W., Xiao C., Lin X., Zhang C. Efficient approximate entity extraction with edit distance constraints. Proc. SIGMOD’09. 2009. P. 759–770.

Chaudhuri S., Kaushik R. Extending autocompletion to tolerate errors. Proc. SIGMOD’09. 2009. P. 707–718.

Li G., Ji S., Li C., Feng J. Efficient fuzzy full-text type-ahead search. The VLDB Journal. 2011. Vol. 20, N 4. P. 617–640.

Feng J., Wang J., Li G. Trie-Join: a trie-based method for efficient string similarity joins. The VLDB Journal. 2012. Vol. 21, N. 4. P. 437–461.

Gouda К., Rashad M. Efficient string edit similarity join algorithm. Computing and Informatics. 2017. Vol. 36. P. 683–704.

Wu S., Manber U. Fast text searching allowing errors. Communications of the ACM. 1992. Vol. 35, N 10. P. 83–91.

Qin J., Xiao C. Pigeonring: A principle for faster thresholded similarity search. Proc. VLDB Endow. 2018. Vol. 12, N 1. P. 28–42.

Baeza-Yates R., Navarro G. Faster approximate string matching. Algorithmica. 1999. Vol. 23, N 2. P. 127–158.

Navarro G., Sutinen E., Tarhio J. Indexing text with approximate q-grams. Journal of Discrete Algorithms. 2005. Vol. 3, N 2–4. P. 157–175.

Ostrovsky R., Rabani Y. Low distortion embedding for edit distance. Journal of the ACM. 2007. Vol. 54, N 5. P. 23–36.

Kushilevitz E., Ostrovsky R., Rabani Y. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing. 2000. Vol. 30, N 2. P. 457–474.

Indyk P. Approximate nearest neighbor under edit distance via product metrics. Proc. SODA’04. 2004. P. 646–650.

Indyk P. Approximate nearest neighbor algorithms for frechet metric via product metrics. Proc. SoCG’02. 2002. P. 102–106.

Andoni A., Indyk P., Krauthgamer R. Overcoming the L1 non-embeddability barrier: Algorithms for product metrics. Proc. SODA’09. 2009. P. 865–874.

Yang Z., Yu J., Kitsuregawa M. Fast algorithms for top-k approximate string matching. Proc. AAAI’10. 2010. P. 1467–1473.

Zhang Z., Hadjieleftheriou M., Ooi B.C., Srivastava D. Bed-tree: An all-purpose index structure for string similarity search based on edit distance. Proc. SIGMOD’10. 2010. P. 915–926.

Morton G.M. A computer oriented geodetic data base; and a new technique in file sequencing. Technical Report. 1966. Ottawa, Canada: IBM Ltd. 20 p.

Lu W., Du X., Hadjieleftheriou M., Ooi B.C. Efficiently supporting edit distance based string similarity search using B+-trees. IEEE Transactions on Knowledge and Data Engineering. 2014. Vol. 26, N 12. P. 2983–2996.

Jagadish H.V., Ooi B.C., Tan K.-L., Yu C., Zhang R. iDistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 2005. Vol. 30, N 2. Р. 364–397.

Deng D., Li G., Feng J., Li W.-S. Top-k string similarity search with edit-distance constraints. Proc. ICDE’13. 2013. P. 925–936.

Wang X., Ding X., Tung A.K.H., Zhang Z. Efficient and effective kNN sequence search with approximate n-grams. Proc. VLDB Endowment. 2013. Vol. 7, N 1. P. 1–12.

Yu M., Wang J., Li G., Zhang Y., Deng D., Feng J. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal. 2017. Vol. 26. P. 249–274.

Rachkovskij D.A. Formation of similarity-reflecting binary vectors with random binary projections. Cybernetics and Systems Analysis. 2012. Vol. 51, N 2. P. 313–323.

Rachkovsky D.A., Gritsenko V.I. Distributed representation of vector data based on random projections [in Ukrainian]. Kyiv: Interservice, 2018. 216 p.

Rachkovskij D.A., Revunova E.G. A randomized method for solving discrete ill-posed problems. Cybernetics and Systems Analysis. 2012. Vol. 48, N 4. P. 621–635.

Revunova E.G. Model selection criteria for a linear model to solve discrete ill-posed problems on the basis of singular decomposition and random projection. Cybernetics and Systems Analysis. 2016. Vol. 52, N 4. P. 647–664.

Revunova E.G. Averaging over matrices in solving discrete ill-posed problems on the basis of random projection. Proc. CSIT’17. 2017. P. 473–478.

McCauley S. Approximate similarity search under edit distance using locality-sensitive hashing. arXiv:1907.01600. 2019.

Rubinstein A. Hardness of approximate nearest neighbor search. Proc. STOC’18. 2018. P. 1260–1268.