C&SA | Contents

Volume 58 >>> № 1 JANUARY — FEBRUARY 2022

-->

UDC 004.891

V. Vrublevskyi¹, O. Marchenko²

¹ Taras Shevchenko National University of Kyiv,
Kyiv, Ukraine

vitalii.vrublevskyi@gmail.com

² Taras Shevchenko National University of Kyiv,
Kyiv, Ukraine

omarchenko@univ.kiev.ua

DEVELOPMENT AND ANALYSIS OF THE MODEL FOR SENTENCE
SEMANTIC REPRESENTATION

Abstract. The authors overview an efficient and simple model of sentence semantic representation for the paraphrase identification problem. The dependency tree was chosen as the main structure to represent the relationships between words in a sentence. To represent the word semantics, pre-trained general-purpose word embeddings are used. Based on these two key components, several features that can help to identify paraphrases are designed. The experiments were conducted, which proved the model efficiency. The results of the model application are rather close to those for state-of-the-art models.

Keywords: natural language processing, paraphrase identification, semantic similarity, dependency tree, word embeddings.

FULL TEXT

REFERENCES

Mikolov T., Chen K., Corrado G., Dean J. Efficient estimation of word representations in vector space. Proc. Workshop at ICLR. 2013. URL: https://arxiv.org/pdf/1301.3781.pdf.

Kiros R., Zhu Y., Salakhutdinov R., Zemel R.S., Torralba A., Urtasun R., Fidler S. Skip-thought vectors. Proc. 28th International Conference on Neural Information Processing Systems (NIPS 2015). (7–12 December 2015, Montreal, Canada). Montreal, 2015. Vol. 2. P. 3294–3302.

Dolan B., Quirk C., Brockett C. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. Proc. 20th International Conference on Computational Linguistics (COLING 2004). (23–27 August 2004, Geneva, Switzerland). Geneva, 2004. P. 350–356. URL: https://aclanthology.org/C04-1051.

Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. Proc 26th International Conference on Neural Information Processing Systems (NIPS 2013). (5–10 December 2013, Lake Tahoe, Nevada, USA). Lake Tahoe Nevada, 2013. Vol. 2. P. 3111–3119.

Papineni K., Roukos S., Ward T., Zhu W.-J. Bleu: A method for automatic evaluation of machine translation. Proc. 40th Annual Meeting on Association for Computational Linguistics (ACL’02). (7–12 July 2002, Philadelphia, Pennsylvania, USA). Philadelphia, 2002. P. 311–318. https://doi.org/10.3115/1073083.1073135.

Cortes C., Vapnik V. Support-vector networks. Mach. Learn. 1995. Vol. 20. P. 273–297. https://doi.org/10.1007/BF00994018.

Kozareva Z., Montoyo A. Paraphrase identification on the basis of supervised machine learning techniques. Proc. 5th International Conference on Natural Language Processing (FinTAL 2006). (23–25 August 2006, Turku, Finland). Turku, 2006. Advances in Natural Language Processing. P. 524–533. https://doi.org/10.1007/11816508_52.

Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press, 1998. 449 p. https://doi.org/10.7551/mitpress/7287.001.0001.

Mihalcea R., Corley C., Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity. Proc. 21st national conference on Artificial intelligence (AAAI ’06). (16-20 July 2006, Boston, Massachusetts). Boston, 2006. Vol.1. P. 775–780.

Landauer T.K., Foltz P.W., Laham, D. An introduction to latent semantic analysis. Discourse Processes. 1998. Vol. 25, Iss. 2-3. P. 259–284. https://doi.org/10.1080/01638539809545028.

Finch A., Sumita E. Using machine translation evaluation techniques to determine sentence-level semantic equivalence. Proc. 3rd International Workshop on Paraphrasing (IWP 2005). (11–13 October 2005, Jeju Island, Korea). Jeju Island, 2005. URL: https://aclanthology.org/I05-5003.

Su K.Y., Wu M.W., Chang J.S. A new quantitative quality measure for machine translation systems. Proc. 14th conference on Computational linguistics (COLING-92). (23-28 August 1992, Nantes, France). Nantes, 1992. Vol. 2, P. 433–439. https://doi.org/10.3115/992133.992137.

Nieen S., Vogel S., Ney H., Tillmann C. A DP based search algorithm for statistical machine translation. Proc. 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL ‘98/COLING ‘98). (10–14 August 1998, Montreal, Quebec, Canada). Montreal, Quebec, 1998. Vol. 2. P. 960–967. https://doi.org/10.3115/980691.980727.

Doddington G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proc. 2nd International Conference on Human Language Technology Research (HLT ‘02). (24–27 March 2002, San Diego, California, USA). San Diego, 2002. P. 138–145.

Milajevs D., Kartsaklis D., Sadrzadeh M., Purver M. Evaluating neural word representations in tensor-based compositional settings. Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). (25-29 October 2014, Doha, Qatar). Doha, 2014. P. 708–719. https://doi.org/10.3115/v1/D14-1079.

Wan S., Dras M., Dale R., Paris C. Using dependency-based features to take the “para-farce” out of paraphrase. Proc. Australasian Language Technology Workshop (ALTA). (30 Nov.-1 Dec. 2006, Sydney, Australia). Sydney, 2006. P. 131–138.

Zhang K., Shasha D. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing. 1989. Vol. 18, Iss. 6. P. 1245–1262. https://doi.org/10.1137/0218082.

Cheng J., Kartsaklis D. Syntax-aware multi-sense word embeddings for deep compositional models of meaning. Proc. 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015). (17–21 September 2015, Lisbon, Portugal). Lisbon, 2015. P. 1531–1542. https://doi.org/10.18653/v1/D15-1177.

Bromley J., Bentz J.W., Bottou L., Guyon I., LeCun Y., Moore C., Sackinger E., Shah R. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence. 1993. Vol. 7, N 4. P. 669–688. https://doi.org/10.1142/S0218001493000339.

Ji Y., Eisenstein J. Discriminative improvements to distributional sentence similarity. Proc. 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). (18-21 October 2013, Seattle, Washington, USA). Seattle, 2013. P. 891–896.

Kuhn H.W. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly. 1955. Vol. 2, Iss. 1–2. P. 83-97. https://doi.org/10.1002/nav.3800020109.

Riesen K., Neuhaus M., Bunke H. Bipartite graph matching for computing the edit distance of graphs. In: Graph-Based Representations in Pattern Recognition. Escolano F., Vento M. (Eds). Lecture Notes in Computer Science. 2007. Vol 4538. P. 1–12. https://doi.org/10.1007/978-3-540-72903-7_1.

Sidorov G., Castillo F., Stamatatos E., Gelbukh A., Chanona-Hernїndez L. Syntactic N-grams as machine learning features for natural language processing. Expert Systems with Applications. 2014. Vol. 41, Iss. 3. P. 853-860. https://doi.org/10.1016/j.eswa.2013.08.015.

Scikit-learn. Machine learning in Python. URL: https://scikit-learn.org/stable/.

SpaCy. URL: https://spacy.io/.

Weischedel R., Hovy E., Marcus M., Palmer M., Belvin R., Pradhan S., Ramshaw L., Xue N. OntoNotes: A large training corpus for enhanced processing. In: Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. Olive J., Christianson C., McCary J. (Eds.). New York: Springer-Verlag, 2011. XXVI, 936 p.

Ul-Qayyum Z., Altaf W. Paraphrase identification using semantic heuristic features. Research Journal of Applied Sciences, Engineering and Technology. 2012. Vol. 4. N 22. P. 4894-4904.

Blacoe W., Lapata M. A comparison of vector-based representations for semantic composition. Proc. 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. (12–14 July 2012, Jeju Island, Korea). Jeju Island, 2012. P. 546–556.

Paraphrase Identification (State of the art) URL: https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art).

UDC 004.891

DEVELOPMENT AND ANALYSIS OF THE MODEL FOR SENTENCE SEMANTIC REPRESENTATION

REFERENCES

DEVELOPMENT AND ANALYSIS OF THE MODEL FOR SENTENCE
SEMANTIC REPRESENTATION