КиСА | Содержание

Том 55 >>> № 5 СЕНТЯБРЬ — ОКТЯБРЬ 2019

УДК 004.22+004.93'11

Д.А. Рачковский

ИНДЕКСНЫЕ СТРУКТУРЫ ДЛЯ БЫСТРОГО ПОИСКА СХОДНЫХ СИМВОЛЬНЫХ СТРОК

Аннотация. Дан обзор индексных структур для быстрого поиска по сходству объектов, представленных символьными строками. Рассмотрены индексные структуры как для точного, так и для приближенного поиска по расстоянию редактирования. Представлены индексные структуры на основе обратного индексирования, сохраняющего сходство хэширования, древовидных структур. Изложены идеи известных и предложенных в последнее время алгоритмов.

Ключевые слова: поиск по сходству, расстояние редактирования, ближайший сосед, ближний сосед, индексные структуры, обратное индексирование, n-граммы, локально-чувствительное хэширование, древовидные структуры.

ПОЛНЫЙ ТЕКСТ

Рачковский Дмитрий Андреевич,
доктор техн. наук, ведущий научный сотрудник Международного научно-учебного центра информационных технологий и систем НАН Украины и МОН Украины, Киев, dar@infrm.kiev.ua

СПИСОК ЛИТЕРАТУРЫ

Rachkovskij D.A. Real-valued vectors for fast distance and similarity estimation. Cybernetics and Systems Analysis. 2016. Vol. 52, N 6. P. 967–988.

Rachkovskij D.A. Binary vectors for fast distance and similarity estimation. Cybernetics and Systems Analysis. 2017. Vol. 53, N 1. P. 138–156.

Rachkovskij D.A. Distance-based index structures for fast similarity search. Cybernetics and Systems Analysis. 2017. Vol. 53, N 4. P. 636–658.

Rachkovskij D.A. Index structures for fast similarity search for binary vectors. Cybernetics and Systems Analysis. 2017. Vol. 53, N 5. P. 799–820.

Rachkovskij D.A. Index structures for fast similarity search for real-valued vectors. I. Cybernetics and Systems Analysis. 2018. Vol. 54, N 1. P. 152–164.

Rachkovskij D.A. Index structures for fast similarity search for real-valued vectors. II. Cybernetics and Systems Analysis. 2018. Vol. 54, N 2. P. 320–335.

Boytsov L. Indexing methods for approximate dictionary searching: Comparative analysis. J. Exp. Algorithmics. 2011. Vol. 16. P. 1.1:1–1.1:91.

Jiang Y., Li G., Feng J., Li W. String similarity joins: An experimental evaluation. Proc. VLDB Endowment. 2014. Vol. 7, N 8. P. 625–636.

Yu M., Li G., Deng D., Feng J. String similarity search and join: a survey. Frontiers of Computer Science. 2016. Vol. 10, N 3. P. 399–417.

Backurs A., Indyk P. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). Proc. STOC’15. 2015. P. 51–58.

Andoni A., Indyk P. Nearest neighbors in high-dimensional spaces. In: Handbook of Discrete and Computational Geometry. 3rd edition. Chap. 43. Boca Raton, USA: CRC Press, 2017. P. 1133–1153.

Andoni A., Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM. 2008. Vol. 51, N 1. P. 117–122.

Mann W., Augsten N., Bouros P. An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 2016. Vol. 9, N 9. P. 636–647.

Jia L., Zhang L.,Yu G., You J., Ding J., Li M. A survey on set similarity search and join. International Journal of Performability Engineering. 2018. Vol. 14, N 2. P. 245–258.

Manber U., Wu S. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett. 1994. Vol. 50, N 4. P. 191–197.

Chegrane I., Belazzougui D. Simple, compact and robust approximate string dictionary. J. Discrete Algorithms. 2014. Vol. 28. P. 49–60.

Belazzougui D. Faster and space-optimal edit distance “1” dictionary. Proc. CPM’09. 2009. P. 154–167.

Belazzougui D., Venturini R. Compressed string dictionary search with edit distance one. Algorithmica. 2016. Vol. 74, N 3. P. 1099–1122.

Chan T., Lewenstein M. Fast string dictionary lookup with one error. Proc. CPM’15. 2015. P. 114–123.

Fredman M.L., Komlos J., Szemeredi E. Storing a sparse table with O(1) worst case access time. Journal of the ACM. 1984. Vol. 31, N 3. P. 538–544.

Karp R.M., Rabin M.O. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development. 1987. Vol. 31, N 2. P. 249–260.

Mor M., Fraenkel A.S. A Hash code method for detecting and correcting spelling errors. Communications of the ACM. 1982.Vol. 25, N 12. P. 935–938.

Muth R., Manber U. Approximate multiple string search. Proc. CPM’96. 1996. P. 75–86.

Broder A., Mitzenmacher M. Network applications of bloom filters: A survey. Internet Mathematics. 2004. Vol. 1, N 4. P. 485–509.

Karch D., Luxen D., Sanders P. Improved fast similarity search in dictionaries. Proc. SPIRE’10. 2010. P. 173–178.

Cole R., Gottlieb L.-A., Lewenstein M. Dictionary matching and indexing with errors and don’t cares. Proc. STOC’04. 2004. P. 91–100.

Chan H., Lam T.W., Sung W., Tam S., Wong S. Compressed indexes for approximate string matching. Algorithmica. 2010. Vol. 58, N 2. P. 263–281.

Sokolov А.M. Vector representations for efficient comparison and search for similar strings. Cybernetics and System Analysis. 2007. Vol. 43, N 4. P. 484–498.

Sokolov A.M. Investigation of accelerated search for close text sequences with the help of vector representations. Cybernetics and Systems Analysis. 2008. Vol. 44, N 4. P. 493–506.

Datar M., Immorlica N., Indyk P., Mirrokni V.S. Locality-sensitive hashing scheme based on p-stable distributions. Proc. SCG’04. 2004. P. 253–262.

Andoni A., Datar M., Immorlica N., Indyk P., Mirrokni V. Locality-Sensitive Hashing using stable distributions. In: Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Shakhnarovich G., Darrell T., Indyk P. (Eds.). Cambridge, MA: MIT Press, 2006. P. 61–72.

Bawa M., Condie T., Ganesan P. Lsh forest: self-tuning indexes for similarity search. Proc. WWW’05. 2005. P. 651–660.

Andoni A., Razenshteyn I., Shekel Nosatzki N. Lsh forest: Practical algorithms made theoretical. Proc. SODA’17. 2017. P. 67–78.

Zhang H., Zhang Q. EmbedJoin: efficient edit similarity joins via embeddings. Proc. KDD’17. 2017. P. 585–594.

Chakraborty D., Goldenberg E., Koucky M. Streaming algorithms for embedding and computing edit distance in the low distance regime. Proc. STOC’16. 2016. P. 712–725.

Li G., Deng D., Wang J., Feng J. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endowment. 2011. Vol. 5, N 3. P. 253–264.

Xiao C., Wang W., Lin X. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endowment. 2008. Vol 1, N 1. P. 933–944.

Wang J., Li G., Feng J. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. Proc. SIGMOD’12. 2012. P. 85–96.

Qin J., Wang W., Lu Y., Xiao C., Lin X. Efficient exact edit similarity query processing with the asymmetric signature scheme. Proc. SIGMOD’11. 2011. P. 1033–1044.

Jokinen P., Ukkonen E. Two algorithms for approximate string matching in static texts. Proc. MFCS’91. 1991. P. 240–248.

Gravano L., Ipeirotis P.G., Jagadish H.V., Koudas N., Muthukrishnan S., Srivastava D. Approximate string joins in a database (almost) for free. Proc. VLDB’01. 2001. P. 491–500.

Li C., Wang B., Yang X.. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. Proc. VLDB’07. 2007. P. 303–314.

Yang X., Wang B., Li C. Cost-based variablelength-gram selection for string collections to support approximate queries efficiently. Proc. SIGMOD’08. 2008. P. 353–364.

Kahveci T., Singh A. An efficient index structure for string databases. Proc. VLDB’01. 2001. P. 351–360.

Jiang Y., Deng D., Wang J., Li G., Feng J. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. Proc. EDBT’13. 2013. P. 341–348.

Wei H., Yu J.X., Lu C. String similarity search: a hash-based approach. IEEE Transactions on Knowledge and Data Engineering. 2018. Vol. 30, N 1. P. 170–184.

Vernicaand R., Li C. Efficient top-k algorithms for fuzzy search in string collections. Proc. KEYS’09. 2009. P. 9–14.

Deng D., Li G., Feng J. A pivotal prefix based filtering algorithm for string similarity search. Proc. SIGMOD’14. 2014. P. 673–684.

Chaudhuri S., Ganti V., Kaushik R. A primitive operator for similarity joins in data cleaning. Proc. ICDE’06. 2006. P. 5–16.

Ukkonen E. Approximate string-matching over suffix trees. In: Combinatorial Pattern Matching. CPM 1993. Lecture Notes in Computer Science. Apostolico A., Crochemore M., Galil Z., Manber U. (Eds.). 1993. Vol 684. P. 228–242.

Bocek T., Hunt E., Hausheer D., Stiller B. Fast similarity search in peer-to-peer networks. Proc. NOMS’08. 2008. P. 240–247.

Wang W., Xiao C., Lin X., Zhang C. Efficient approximate entity extraction with edit distance constraints. Proc. SIGMOD’09. 2009. P. 759–770.

Chaudhuri S., Kaushik R. Extending autocompletion to tolerate errors. Proc. SIGMOD’09. 2009. P. 707–718.

Li G., Ji S., Li C., Feng J. Efficient fuzzy full-text type-ahead search. The VLDB Journal. 2011. Vol. 20, N 4. P. 617–640.

Feng J., Wang J., Li G. Trie-Join: a trie-based method for efficient string similarity joins. The VLDB Journal. 2012. Vol. 21, N. 4. P. 437–461.

Gouda К., Rashad M. Efficient string edit similarity join algorithm. Computing and Informatics. 2017. Vol. 36. P. 683–704.

Wu S., Manber U. Fast text searching allowing errors. Communications of the ACM. 1992. Vol. 35, N 10. P. 83–91.

Qin J., Xiao C. Pigeonring: A principle for faster thresholded similarity search. Proc. VLDB Endow. 2018. Vol. 12, N 1. P. 28–42.

Baeza-Yates R., Navarro G. Faster approximate string matching. Algorithmica. 1999. Vol. 23, N 2. P. 127–158.

Navarro G., Sutinen E., Tarhio J. Indexing text with approximate q-grams. Journal of Discrete Algorithms. 2005. Vol. 3, N 2–4. P. 157–175.

Ostrovsky R., Rabani Y. Low distortion embedding for edit distance. Journal of the ACM. 2007. Vol. 54, N 5. P. 23–36.

Kushilevitz E., Ostrovsky R., Rabani Y. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing. 2000. Vol. 30, N 2. P. 457–474.

Indyk P. Approximate nearest neighbor under edit distance via product metrics. Proc. SODA’04. 2004. P. 646–650.

Indyk P. Approximate nearest neighbor algorithms for frechet metric via product metrics. Proc. SoCG’02. 2002. P. 102–106.

Andoni A., Indyk P., Krauthgamer R. Overcoming the L1 non-embeddability barrier: Algorithms for product metrics. Proc. SODA’09. 2009. P. 865–874.

Yang Z., Yu J., Kitsuregawa M. Fast algorithms for top-k approximate string matching. Proc. AAAI’10. 2010. P. 1467–1473.

Zhang Z., Hadjieleftheriou M., Ooi B.C., Srivastava D. Bed-tree: An all-purpose index structure for string similarity search based on edit distance. Proc. SIGMOD’10. 2010. P. 915–926.

Morton G.M. A computer oriented geodetic data base; and a new technique in file sequencing. Technical Report. 1966. Ottawa, Canada: IBM Ltd. 20 p.

Lu W., Du X., Hadjieleftheriou M., Ooi B.C. Efficiently supporting edit distance based string similarity search using B+-trees. IEEE Transactions on Knowledge and Data Engineering. 2014. Vol. 26, N 12. P. 2983–2996.

Jagadish H.V., Ooi B.C., Tan K.-L., Yu C., Zhang R. iDistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 2005. Vol. 30, N 2. Р. 364–397.

Deng D., Li G., Feng J., Li W.-S. Top-k string similarity search with edit-distance constraints. Proc. ICDE’13. 2013. P. 925–936.

Wang X., Ding X., Tung A.K.H., Zhang Z. Efficient and effective kNN sequence search with approximate n-grams. Proc. VLDB Endowment. 2013. Vol. 7, N 1. P. 1–12.

Yu M., Wang J., Li G., Zhang Y., Deng D., Feng J. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal. 2017. Vol. 26. P. 249–274.

Rachkovskij D.A. Formation of similarity-reflecting binary vectors with random binary projections. Cybernetics and Systems Analysis. 2012. Vol. 51, N 2. P. 313–323.

Рачковський Д.А., Гриценко В.І. Розподілене подання векторних даних на основі випадкових проекцій. Київ: Інтерсервіс, 2018. 216 c.

Rachkovskij D.A., Revunova E.G. A randomized method for solving discrete ill-posed problems. Cybernetics and Systems Analysis. 2012. Vol. 48, N 4. P. 621–635.

Revunova E.G. Model selection criteria for a linear model to solve discrete ill-posed problems on the basis of singular decomposition and random projection. Cybernetics and Systems Analysis. 2016. Vol. 52, N 4. P. 647–664.

Revunova E.G. Averaging over matrices in solving discrete ill-posed problems on the basis of random projection. Proc. CSIT’17. 2017. P. 473–478.

McCauley S. Approximate similarity search under edit distance using locality-sensitive hashing. arXiv:1907.01600. 2019.

Rubinstein A. Hardness of approximate nearest neighbor search. Proc. STOC’18. 2018. P. 1260–1268.