УДК 004.22+004.93'11
ИНДЕКСНЫЕ СТРУКТУРЫ ДЛЯ БЫСТРОГО ПОИСКА СХОДНЫХ СИМВОЛЬНЫХ СТРОК
Аннотация. Дан обзор индексных структур для быстрого поиска по сходству объектов, представленных символьными строками. Рассмотрены индексные структуры как для точного, так и для приближенного поиска по расстоянию редактирования. Представлены индексные структуры на основе обратного индексирования, сохраняющего сходство хэширования, древовидных структур. Изложены идеи известных и предложенных в последнее время алгоритмов.
Ключевые слова: поиск по сходству, расстояние редактирования, ближайший сосед, ближний сосед, индексные структуры,
обратное индексирование, n-граммы, локально-чувствительное хэширование, древовидные структуры.
ПОЛНЫЙ ТЕКСТ
Рачковский Дмитрий Андреевич,
доктор техн. наук, ведущий научный сотрудник Международного научно-учебного центра информационных технологий и систем НАН Украины и МОН Украины, Киев,
dar@infrm.kiev.ua
СПИСОК ЛИТЕРАТУРЫ
- Rachkovskij D.A. Real-valued vectors for fast distance and similarity estimation. Cybernetics and Systems Analysis. 2016. Vol. 52, N 6. P. 967–988.
- Rachkovskij D.A. Binary vectors for fast distance and similarity estimation. Cybernetics and Systems Analysis. 2017. Vol. 53, N 1. P. 138–156.
- Rachkovskij D.A. Distance-based index structures for fast similarity search. Cybernetics and Systems Analysis. 2017. Vol. 53, N 4. P. 636–658.
- Rachkovskij D.A. Index structures for fast similarity search for binary vectors. Cybernetics and Systems Analysis. 2017. Vol. 53, N 5. P. 799–820.
- Rachkovskij D.A. Index structures for fast similarity search for real-valued vectors. I. Cybernetics and Systems Analysis. 2018. Vol. 54, N 1. P. 152–164.
- Rachkovskij D.A. Index structures for fast similarity search for real-valued vectors. II. Cybernetics and Systems Analysis. 2018. Vol. 54, N 2. P. 320–335.
- Boytsov L. Indexing methods for approximate dictionary searching: Comparative analysis. J. Exp. Algorithmics. 2011. Vol. 16. P. 1.1:1–1.1:91.
- Jiang Y., Li G., Feng J., Li W. String similarity joins: An experimental evaluation. Proc. VLDB Endowment. 2014. Vol. 7, N 8. P. 625–636.
- Yu M., Li G., Deng D., Feng J. String similarity search and join: a survey. Frontiers of Computer Science. 2016. Vol. 10, N 3. P. 399–417.
- Backurs A., Indyk P. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). Proc. STOC’15. 2015. P. 51–58.
- Andoni A., Indyk P. Nearest neighbors in high-dimensional spaces. In: Handbook of Discrete and Computational Geometry. 3rd edition. Chap. 43. Boca Raton, USA: CRC Press, 2017. P. 1133–1153.
- Andoni A., Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM. 2008. Vol. 51, N 1. P. 117–122.
- Mann W., Augsten N., Bouros P. An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 2016. Vol. 9, N 9. P. 636–647.
- Jia L., Zhang L.,Yu G., You J., Ding J., Li M. A survey on set similarity search and join. International Journal of Performability Engineering. 2018. Vol. 14, N 2. P. 245–258.
- Manber U., Wu S. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett. 1994. Vol. 50, N 4. P. 191–197.
- Chegrane I., Belazzougui D. Simple, compact and robust approximate string dictionary. J. Discrete Algorithms. 2014. Vol. 28. P. 49–60.
- Belazzougui D. Faster and space-optimal edit distance “1” dictionary. Proc. CPM’09. 2009. P. 154–167.
- Belazzougui D., Venturini R. Compressed string dictionary search with edit distance one. Algorithmica. 2016. Vol. 74, N 3. P. 1099–1122.
- Chan T., Lewenstein M. Fast string dictionary lookup with one error. Proc. CPM’15. 2015. P. 114–123.
- Fredman M.L., Komlos J., Szemeredi E. Storing a sparse table with O(1) worst case access time. Journal of the ACM. 1984. Vol. 31, N 3. P. 538–544.
- Karp R.M., Rabin M.O. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development. 1987. Vol. 31, N 2. P. 249–260.
- Mor M., Fraenkel A.S. A Hash code method for detecting and correcting spelling errors. Communications of the ACM. 1982.Vol. 25, N 12. P. 935–938.
- Muth R., Manber U. Approximate multiple string search. Proc. CPM’96. 1996. P. 75–86.
- Broder A., Mitzenmacher M. Network applications of bloom filters: A survey. Internet Mathematics. 2004. Vol. 1, N 4. P. 485–509.
- Karch D., Luxen D., Sanders P. Improved fast similarity search in dictionaries. Proc. SPIRE’10. 2010. P. 173–178.
- Cole R., Gottlieb L.-A., Lewenstein M. Dictionary matching and indexing with errors and don’t cares. Proc. STOC’04. 2004. P. 91–100.
- Chan H., Lam T.W., Sung W., Tam S., Wong S. Compressed indexes for approximate string matching. Algorithmica. 2010. Vol. 58, N 2. P. 263–281.
- Sokolov А.M. Vector representations for efficient comparison and search for similar strings. Cybernetics and System Analysis. 2007. Vol. 43, N 4. P. 484–498.
- Sokolov A.M. Investigation of accelerated search for close text sequences with the help of vector representations. Cybernetics and Systems Analysis. 2008. Vol. 44, N 4. P. 493–506.
- Datar M., Immorlica N., Indyk P., Mirrokni V.S. Locality-sensitive hashing scheme based on p-stable distributions. Proc. SCG’04. 2004. P. 253–262.
- Andoni A., Datar M., Immorlica N., Indyk P., Mirrokni V. Locality-Sensitive Hashing using stable distributions. In: Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Shakhnarovich G., Darrell T., Indyk P. (Eds.). Cambridge, MA: MIT Press, 2006. P. 61–72.
- Bawa M., Condie T., Ganesan P. Lsh forest: self-tuning indexes for similarity search. Proc. WWW’05. 2005. P. 651–660.
- Andoni A., Razenshteyn I., Shekel Nosatzki N. Lsh forest: Practical algorithms made theoretical. Proc. SODA’17. 2017. P. 67–78.
- Zhang H., Zhang Q. EmbedJoin: efficient edit similarity joins via embeddings. Proc. KDD’17. 2017. P. 585–594.
- Chakraborty D., Goldenberg E., Koucky M. Streaming algorithms for embedding and computing edit distance in the low distance regime. Proc. STOC’16. 2016. P. 712–725.
- Li G., Deng D., Wang J., Feng J. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endowment. 2011. Vol. 5, N 3. P. 253–264.
- Xiao C., Wang W., Lin X. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endowment. 2008. Vol 1, N 1. P. 933–944.
- Wang J., Li G., Feng J. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. Proc. SIGMOD’12. 2012. P. 85–96.
- Qin J., Wang W., Lu Y., Xiao C., Lin X. Efficient exact edit similarity query processing with the asymmetric signature scheme. Proc. SIGMOD’11. 2011. P. 1033–1044.
- Jokinen P., Ukkonen E. Two algorithms for approximate string matching in static texts. Proc. MFCS’91. 1991. P. 240–248.
- Gravano L., Ipeirotis P.G., Jagadish H.V., Koudas N., Muthukrishnan S., Srivastava D. Approximate string joins in a database (almost) for free. Proc. VLDB’01. 2001. P. 491–500.
- Li C., Wang B., Yang X.. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. Proc. VLDB’07. 2007. P. 303–314.
- Yang X., Wang B., Li C. Cost-based variablelength-gram selection for string collections to support approximate queries efficiently. Proc. SIGMOD’08. 2008. P. 353–364.
- Kahveci T., Singh A. An efficient index structure for string databases. Proc. VLDB’01. 2001. P. 351–360.
- Jiang Y., Deng D., Wang J., Li G., Feng J. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. Proc. EDBT’13. 2013. P. 341–348.
- Wei H., Yu J.X., Lu C. String similarity search: a hash-based approach. IEEE Transactions on Knowledge and Data Engineering. 2018. Vol. 30, N 1. P. 170–184.
- Vernicaand R., Li C. Efficient top-k algorithms for fuzzy search in string collections. Proc. KEYS’09. 2009. P. 9–14.
- Deng D., Li G., Feng J. A pivotal prefix based filtering algorithm for string similarity search. Proc. SIGMOD’14. 2014. P. 673–684.
- Chaudhuri S., Ganti V., Kaushik R. A primitive operator for similarity joins in data cleaning. Proc. ICDE’06. 2006. P. 5–16.
- Ukkonen E. Approximate string-matching over suffix trees. In: Combinatorial Pattern Matching. CPM 1993. Lecture Notes in Computer Science. Apostolico A., Crochemore M., Galil Z., Manber U. (Eds.). 1993. Vol 684. P. 228–242.
- Bocek T., Hunt E., Hausheer D., Stiller B. Fast similarity search in peer-to-peer networks. Proc. NOMS’08. 2008. P. 240–247.
- Wang W., Xiao C., Lin X., Zhang C. Efficient approximate entity extraction with edit distance constraints. Proc. SIGMOD’09. 2009. P. 759–770.
- Chaudhuri S., Kaushik R. Extending autocompletion to tolerate errors. Proc. SIGMOD’09. 2009. P. 707–718.
- Li G., Ji S., Li C., Feng J. Efficient fuzzy full-text type-ahead search. The VLDB Journal. 2011. Vol. 20, N 4. P. 617–640.
- Feng J., Wang J., Li G. Trie-Join: a trie-based method for efficient string similarity joins. The VLDB Journal. 2012. Vol. 21, N. 4. P. 437–461.
- Gouda К., Rashad M. Efficient string edit similarity join algorithm. Computing and Informatics. 2017. Vol. 36. P. 683–704.
- Wu S., Manber U. Fast text searching allowing errors. Communications of the ACM. 1992. Vol. 35, N 10. P. 83–91.
- Qin J., Xiao C. Pigeonring: A principle for faster thresholded similarity search. Proc. VLDB Endow. 2018. Vol. 12, N 1. P. 28–42.
- Baeza-Yates R., Navarro G. Faster approximate string matching. Algorithmica. 1999. Vol. 23, N 2. P. 127–158.
- Navarro G., Sutinen E., Tarhio J. Indexing text with approximate q-grams. Journal of Discrete Algorithms. 2005. Vol. 3, N 2–4. P. 157–175.
- Ostrovsky R., Rabani Y. Low distortion embedding for edit distance. Journal of the ACM. 2007. Vol. 54, N 5. P. 23–36.
- Kushilevitz E., Ostrovsky R., Rabani Y. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing. 2000. Vol. 30, N 2. P. 457–474.
- Indyk P. Approximate nearest neighbor under edit distance via product metrics. Proc. SODA’04. 2004. P. 646–650.
- Indyk P. Approximate nearest neighbor algorithms for frechet metric via product metrics. Proc. SoCG’02. 2002. P. 102–106.
- Andoni A., Indyk P., Krauthgamer R. Overcoming the L1 non-embeddability barrier: Algorithms for product metrics. Proc. SODA’09. 2009. P. 865–874.
- Yang Z., Yu J., Kitsuregawa M. Fast algorithms for top-k approximate string matching. Proc. AAAI’10. 2010. P. 1467–1473.
- Zhang Z., Hadjieleftheriou M., Ooi B.C., Srivastava D. Bed-tree: An all-purpose index structure for string similarity search based on edit distance. Proc. SIGMOD’10. 2010. P. 915–926.
- Morton G.M. A computer oriented geodetic data base; and a new technique in file sequencing. Technical Report. 1966. Ottawa, Canada: IBM Ltd. 20 p.
- Lu W., Du X., Hadjieleftheriou M., Ooi B.C. Efficiently supporting edit distance based string similarity search using B+-trees. IEEE Transactions on Knowledge and Data Engineering. 2014. Vol. 26, N 12. P. 2983–2996.
- Jagadish H.V., Ooi B.C., Tan K.-L., Yu C., Zhang R. iDistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 2005. Vol. 30, N 2. Р. 364–397.
- Deng D., Li G., Feng J., Li W.-S. Top-k string similarity search with edit-distance constraints. Proc. ICDE’13. 2013. P. 925–936.
- Wang X., Ding X., Tung A.K.H., Zhang Z. Efficient and effective kNN sequence search with approximate n-grams. Proc. VLDB Endowment. 2013. Vol. 7, N 1. P. 1–12.
- Yu M., Wang J., Li G., Zhang Y., Deng D., Feng J. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal. 2017. Vol. 26. P. 249–274.
- Rachkovskij D.A. Formation of similarity-reflecting binary vectors with random binary projections. Cybernetics and Systems Analysis. 2012. Vol. 51, N 2. P. 313–323.
- Рачковський Д.А., Гриценко В.І. Розподілене подання векторних даних на основі випадкових проекцій. Київ: Інтерсервіс, 2018. 216 c.
- Rachkovskij D.A., Revunova E.G. A randomized method for solving discrete ill-posed problems. Cybernetics and Systems Analysis. 2012. Vol. 48, N 4. P. 621–635.
- Revunova E.G. Model selection criteria for a linear model to solve discrete ill-posed problems on the basis of singular decomposition and random projection. Cybernetics and Systems Analysis. 2016. Vol. 52, N 4. P. 647–664.
- Revunova E.G. Averaging over matrices in solving discrete ill-posed problems on the basis of random projection. Proc. CSIT’17. 2017. P. 473–478.
- McCauley S. Approximate similarity search under edit distance using locality-sensitive hashing. arXiv:1907.01600. 2019.
- Rubinstein A. Hardness of approximate nearest neighbor search. Proc. STOC’18. 2018. P. 1260–1268.