УДК 004.891
В.Н. ВРУБЛЕВСЬКИЙ,
Київський національний університет імені Тараса Шевченка, Київ, Україна,
vitalii.vrublevskyi@gmail.com
О.О. МАРЧЕНКО,
Київський національний університет імені Тараса Шевченка, Київ, Україна,
omarchenko@univ.kiev.ua
РОЗРОБЛЕННЯ ТА ДОСЛІДЖЕННЯ МОДЕЛІ
ПРЕДСТАВЛЕННЯ СЕМАНТИКИ РЕЧЕНЬ
Анотація. Наведено огляд ефективної та простої моделі представлення семантики речень у контексті задачі ідентифікації парафразів.
Дерево залежностей обрано як основну структуру для представлення зв’язків між словами у реченні.
Для представлення семантики слова використано попередньо навчені моделі представлення слів.
На основі цих двох ключових складових розроблено декілька ознак, які допомагають точно визначити парафрази.
Проведені експерименти довели, що модель є ефективною. Результати її застосування є відносно близькими
до результатів найсучасніших моделей.
Ключові слова: оброблення природної мови, ідентифікація парафразів, семантична подібність, дерево залежностей, векторне представлення слів.
ПОВНИЙ ТЕКСТ
СПИСОК ЛІТЕРАТУРИ
- Mikolov T., Chen K., Corrado G., Dean J. Efficient estimation of word representations in vector space. Proc. Workshop at ICLR. 2013. URL: https://arxiv.org/pdf/1301.3781.pdf.
- Kiros R., Zhu Y., Salakhutdinov R., Zemel R.S., Torralba A., Urtasun R., Fidler S. Skip-thought vectors. Proc. 28th International Conference on Neural Information Processing Systems (NIPS 2015). (7–12 December 2015, Montreal, Canada). Montreal, 2015. Vol. 2. P. 3294–3302.
- Dolan B., Quirk C., Brockett C. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. Proc. 20th International Conference on Computational Linguistics (COLING 2004). (23–27 August 2004, Geneva, Switzerland). Geneva, 2004. P. 350–356. URL: https://aclanthology.org/C04-1051.
- Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. Proc 26th International Conference on Neural Information Processing Systems (NIPS 2013). (5–10 December 2013, Lake Tahoe, Nevada, USA). Lake Tahoe Nevada, 2013. Vol. 2. P. 3111–3119.
- Papineni K., Roukos S., Ward T., Zhu W.-J. Bleu: A method for automatic evaluation of machine translation. Proc. 40th Annual Meeting on Association for Computational Linguistics (ACL’02). (7–12 July 2002, Philadelphia, Pennsylvania, USA). Philadelphia, 2002. P. 311–318. https://doi.org/10.3115/1073083.1073135.
- Cortes C., Vapnik V. Support-vector networks. Mach. Learn. 1995. Vol. 20. P. 273–297. https://doi.org/10.1007/BF00994018.
- Kozareva Z., Montoyo A. Paraphrase identification on the basis of supervised machine learning techniques. Proc. 5th International Conference on Natural Language Processing (FinTAL 2006). (23–25 August 2006, Turku, Finland). Turku, 2006. Advances in Natural Language Processing. P. 524–533. https://doi.org/10.1007/11816508_52.
- Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press, 1998. 449 p. https://doi.org/10.7551/mitpress/7287.001.0001.
- Mihalcea R., Corley C., Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity. Proc. 21st national conference on Artificial intelligence (AAAI ’06). (16-20 July 2006, Boston, Massachusetts). Boston, 2006. Vol.1. P. 775–780.
- Landauer T.K., Foltz P.W., Laham, D. An introduction to latent semantic analysis. Discourse Processes. 1998. Vol. 25, Iss. 2-3. P. 259–284. https://doi.org/10.1080/01638539809545028.
- Finch A., Sumita E. Using machine translation evaluation techniques to determine sentence-level semantic equivalence. Proc. 3rd International Workshop on Paraphrasing (IWP 2005). (11–13 October 2005, Jeju Island, Korea). Jeju Island, 2005. URL: https://aclanthology.org/I05-5003.
- Su K.Y., Wu M.W., Chang J.S. A new quantitative quality measure for machine translation systems. Proc. 14th conference on Computational linguistics (COLING-92). (23-28 August 1992, Nantes, France). Nantes, 1992. Vol. 2, P. 433–439. https://doi.org/10.3115/992133.992137.
- Nieen S., Vogel S., Ney H., Tillmann C. A DP based search algorithm for statistical machine translation. Proc. 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL ‘98/COLING ‘98). (10–14 August 1998, Montreal, Quebec, Canada). Montreal, Quebec, 1998. Vol. 2. P. 960–967. https://doi.org/10.3115/980691.980727.
- Doddington G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proc. 2nd International Conference on Human Language Technology Research (HLT ‘02). (24–27 March 2002, San Diego, California, USA). San Diego, 2002. P. 138–145.
- Milajevs D., Kartsaklis D., Sadrzadeh M., Purver M. Evaluating neural word representations in tensor-based compositional settings. Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). (25-29 October 2014, Doha, Qatar). Doha, 2014. P. 708–719. https://doi.org/10.3115/v1/D14-1079.
- Wan S., Dras M., Dale R., Paris C. Using dependency-based features to take the “para-farce” out of paraphrase. Proc. Australasian Language Technology Workshop (ALTA). (30 Nov.-1 Dec. 2006, Sydney, Australia). Sydney, 2006. P. 131–138.
- Zhang K., Shasha D. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing. 1989. Vol. 18, Iss. 6. P. 1245–1262. https://doi.org/10.1137/0218082.
- Cheng J., Kartsaklis D. Syntax-aware multi-sense word embeddings for deep compositional models of meaning. Proc. 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015). (17–21 September 2015, Lisbon, Portugal). Lisbon, 2015. P. 1531–1542. https://doi.org/10.18653/v1/D15-1177.
- Bromley J., Bentz J.W., Bottou L., Guyon I., LeCun Y., Moore C., Sackinger E., Shah R. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence. 1993. Vol. 7, N 4. P. 669–688. https://doi.org/10.1142/S0218001493000339.
- Ji Y., Eisenstein J. Discriminative improvements to distributional sentence similarity. Proc. 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). (18-21 October 2013, Seattle, Washington, USA). Seattle, 2013. P. 891–896.
- Kuhn H.W. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly. 1955. Vol. 2, Iss. 1–2. P. 83-97. https://doi.org/10.1002/nav.3800020109.
- Riesen K., Neuhaus M., Bunke H. Bipartite graph matching for computing the edit distance of graphs. In: Graph-Based Representations in Pattern Recognition. Escolano F., Vento M. (Eds). Lecture Notes in Computer Science. 2007. Vol 4538. P. 1–12. https://doi.org/10.1007/978-3-540-72903-7_1.
- Sidorov G., Castillo F., Stamatatos E., Gelbukh A., Chanona-Hernїndez L. Syntactic N-grams as machine learning features for natural language processing. Expert Systems with Applications. 2014. Vol. 41, Iss. 3. P. 853-860. https://doi.org/10.1016/j.eswa.2013.08.015.
- Scikit-learn. Machine learning in Python. URL: https://scikit-learn.org/stable/.
- SpaCy. URL: https://spacy.io/.
- Weischedel R., Hovy E., Marcus M., Palmer M., Belvin R., Pradhan S., Ramshaw L., Xue N. OntoNotes: A large training corpus for enhanced processing. In: Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. Olive J., Christianson C., McCary J. (Eds.). New York: Springer-Verlag, 2011. XXVI, 936 p.
- Ul-Qayyum Z., Altaf W. Paraphrase identification using semantic heuristic features. Research Journal of Applied Sciences, Engineering and Technology. 2012. Vol. 4. N 22. P. 4894-4904.
- Blacoe W., Lapata M. A comparison of vector-based representations for semantic composition. Proc. 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. (12–14 July 2012, Jeju Island, Korea). Jeju Island, 2012. P. 546–556.
- Paraphrase Identification (State of the art) URL: https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art).