КіСА | Зміст

Том 60 >>> № 5 ВЕРЕСЕНЬ — ЖОВТЕНЬ 2024

-->

УДК 004.912

Р.Б. АЗІМОВ
Інститут систем керування Міністерства науки та освіти Республіки Азербайджан, Баку, Азербайджан, rustemazimov1999@gmail.com

ПОРІВНЯЛЬНИЙ АНАЛІЗ ВИКОРИСТАННЯ РІЗНИХ ОЗНАК ТЕКСТУ,
МОДЕЛЕЙ ТА МЕТОДІВ ДЛЯ РОЗПІЗНАВАННЯ АВТОРА ТЕКСТУ

Анотація. У комп’ютерній системі розпізнавання авторства текстів застосовано різноманітні методи та моделі для визначення авторів на прикладах текстів письменників Азербайджану. Порівняно ефективність використання різних ознак тексту та запропонованих процедур відбору ознак. Проведено комп’ютерні експерименти на творах кількох відомих письменників Азербайджану, написаних азербайджанською мовою. Проаналізовано отримані результати.

Ключові слова: розпізнавання автора, ідентифікація автора, розпізнавання авторства літературних творів, інжиніринг ознак тексту.

повний текст

СПИСОК ЛІТЕРАТУРИ

1. Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology. 2009. Vol. 60, N 3. P. 538–556. doi.org/10.1002/asi.21001.

2. Mosteller F., Wallace D.L. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers. Journal of the American Statistical Association. 1963. Vol. 58, N 302. P. 275–309. doi.org/10.1080/01621459.1963.10500849.

3. Diederich J., Kindermann J., Leopold E., Paass G. Authorship attribution with support vector machines. Applied Intelligence. 2003. Vol. 19, N 1. P. 109–123. doi.org/10.1023/A:1023824908771.

4. Doghan S., Diri B. A new N-gram based classification (Ng-ind) for Turkish documents: author, genre and gender. Turkish Informatics Foundation Journal of Computer Science and Engineering. 2010. Vol. 3, N 1. P. 11–19 (in Turkish). URL: dergipark.org.tr/tr/pub/tbbmd/issue/22242/238775.

5. Levent V.E., Diri B. Author recognition in Turkish documents with artificial neural networks. Proc. XVI Academic Informatics Conference (5–7 February 2014, Mersin, Turkey). Mersin, 2014. P. 735–741 (in Turkish). URL: ab.org.tr/ab14/kitap/levent_diri_ab14.pdf .

6. Yasdi M., Diri B. Author recognition by abstract feature extraction. Proc. 2012 20th Signal Processing and Communications Applications Conference (SIU) (18–20 April 2012, Mugla, Turkey). Mugla, 2012. P. 1–4 (in Turkish). doi.org/10.1109/SIU.2012.6204690.

7. Bilgin M. A novel method proposal to increase the classification success of Turkish texts. Uludag University Journal of the Faculty of Engineering. 2019. Vol. 24, N 1. P. 125–136 (in Turkish). doi.org/10.17482/uumfd.484525.

8. Erdoghan I., Gullu M., Polat H. Developing an end-to-end author recognition application with machine learning algorithms. El-Cezeri Journal of Science and Engineering. 2022. Vol. 9, N 4. P. 1303–1314 (in Turkish). doi.org/10.31202/ecjse.1134698.

9. Graovac J. Text categorization using n-gram based language independent technique. Intelligent Data Analysis. 2014. Vol. 18, N 4. P. 677–695. URL: https://api.semanticscholar.org/CorpusID:7162968.

10. Halvani O., Winter C., Pflug A. Authorship verification for different languages, genres and topics. Proc. Third Annual DFRWS Europe (29–31 March 2016, Lausanne, Switzerland). Lausanne, 2016. P. 33–43. https://doi.org/10.1016/j.diin.2016.01.006.

11. Keselj V., Peng F., Cercone N., Thomas C. N-gram-based author profiles for authorship attribution. Proc. Conference Pacific Association for Computational Linguistics (PACLING) (22–25 August 2003, Halifax, Nova Scotia, Canada). Halifax, 2003. Vol. 3. P. 255–264. URL: https://web.cs.dal.ca/~vlado/papers/pacling03.pdf .

12. Mendenhall T.C. The characteristic curves of composition. Science. 1887. Vol. ns-9, N 214s. P. 237–246. https://doi.org/10.1126/science.ns-9.214S.237.

13. Mendenhall T.C. A mechanical solution of a literary problem. The Popular Science Monthly. 1901. Vol. 60. P. 97–105. URL: https://api.semanticscholar.org/CorpusID:121781963.

14. Zhao Y., Zobel J. Searching with style: Authorship attribution in classic literature. Proc. Thirtieth Australasian Conference on Computer Science (ACSC2007) (30 January – 2 February 2007, Ballarat, Australia). Ballarat, 2007. Vol. 62. P. 59–68. URL: https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV62Zhao.pdf .

15. Aida-zade K.R., Mustafayev E.M., Azimov R.B. Features analysis for application in a computer recognition systems of Azerbaijani texts authorship. Proc. Second International Bilateral Workshop on Science Between Dokuz Eyll University and Azerbaijan National Academy of Sciences (18 November 2022, virtual event). P. 11. URL: https://drive.google.com/ file/d/1ThwglS3wxd-cevC_c7Y5qgCV3ZNCHNJ6/view?usp=drive_link .

16. Mustafayev E.M., Azimov R.B. Comparative analysis of different feature sets for use in a computer system that recognizes authorship of texts in Azerbaijani language. Proc. II Republican Scientific Conference on “Fundamental Problems of Mathematics and Application of Intellectual Technologies in Education” (15–6 December 2022, Sumgayit, Azerbaijan). Sumgayit, 2022. P. 34–39 (in Azerbaijani). URL: https://www.ssu-conferenceproceedings.edu.az/pdf/riyaziyyat2022.pdf .

17. Anisimov A.V., Porkhun E.V., Taranukha V.Yu. Algorithm for construction of parametric vectors for solution of classification problems by a feed-forward neural network. Cybernetics and Systems Analysis. 2007. Vol. 43, N 2. P. 161–170. https://doi.org/10.1007/s10559-007-0035-9.

18. Howedi F., Mohd M. Text classification for authorship attribution using Naive Bayes classifier with limited training data. Computer Engineering and Intelligent Systems. 2014. Vol. 5, N 4. P. 48–56. URL: https://api.semanticscholar.org/CorpusID:54823714.

19. Ayda-zade K., Talibov S. Аnalysis of the methods for the authorship identification of the text in the Azerbaijani language. Problems of Information Technology. 2017. Vol. 8, N 1. P. 14–23. https://doi.org/10.25045/jpit.v08.i1.02.

20. Marchenko O., Anisimov A., Nykonenko A., Rossada T., Melnikov E. Authorship attribution system. Proc. 22nd International Conference on Applications of Natural Language to Information Systems (NLDB 2017) (21–23 June 2017, LiЩge, Belgium). LiЩge, 2022. LNISA. 2017. Vol. 10260. P. 227–231. https://doi.org/10.1007/978-3-319-59569-6_27.

21. Borisov E.S. Using artificial neural networks for classification of black-and-white images. Cybernetics and Systems Analysis. 2008. Vol. 44, N 2. P. 304–307. https://doi.org/ 10.1007/s10559-008-0030-9.

22. Aida-zade K.R., Rustamov S.S., Clements M.A., Mustafayev E.E. Adaptive neuro-fuzzy inference system for classification of texts. In: Recent Developments and the New Direction in Soft-Computing Foundations and Applications. Zadeh L., Yager R., Shahbazova S., Reformat M., Kreinovich V. (Eds.). STUDFUZZ. 2018. Vol. 361. P. 63–70. https://doi.org/10.1007/978-3-319-75408-6_6.

23. Dabagh R.M. Authorship attribution and statistical text analysis. Advances in Methodology and Statistics. 2007. Vol. 4, N 2. P. 149–163. URL: https://old.stat-d.si/mz/mz4.1/dabagh.pdf .

24. Orucu F., Dalkilich G. Author identification using N-grams and SVM. Proc. of the 1st International Symposium on Computing in Science & Engineering (ISCSE) (3–5 June 2010, Izmir, Turkey). Izmir, 2010. P. 3–5.

25. Stamatatos E. Author identification using imbalanced and limited training texts. Proc. 18th International Workshop on Database and Expert Systems Applications (DEXA 2007) (03-07 September 2007, Regensburg, Germany). Regensburg, 2007. P. 237–241. https://doi.org/10.1109/DEXA.2007.5.

26. Stamatatos E. Ensemble-based author identification using character n-grams. Proc. 3rd International Workshop on Text-based Information Retrieval (29 August 2006, Riva del Garda, Italy). Riva del Garda, 2006. P. 41–46. URL: https://downloads.webis.de/publications/papers/stein_2006f.pdf#page=45.

27. Lupei M., Mitsa A., Repariuk V., Sharkan V. Identification of authorship of Ukrainian-language texts of journalistic style using neural networks. Eastern-European Journal of Enterprise Technologies. 2020. Vol. 1, N 2 (103). P. 30–36. https://doi.org/10.15587/1729-4061. 2020.195041.

28. Yule G.U. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika. 1939. Vol. 30, N 3/4. P. 363–390. https://doi.org/10.2307/2332655.

29. Wilcox R.R. Fundamentals of modern statistical methods: substantially improving power and accuracy. New York: Springer, 2010. 249 p. https://doi.org/10.1007/978-1-4419-5525-8.

30. Kingma D.P., Ba J. Adam: a method for stochastic optimization. arXiv:1412.6980v9 [cs.LG] 30 Jan 2017. https://doi.org/10.48550/arXiv.1412.6980.

31. GБron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. 2nd ed. O’Reilly Media, 2019. 848 p. URL: https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632.

32. Pedregosa F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011. Vol. 12. P. 2825–2830. URL: https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf?.

33. Chollet F. Keras. 2015. URL: https://github.com/fchollet/keras .

УДК 004.912

ПОРІВНЯЛЬНИЙ АНАЛІЗ ВИКОРИСТАННЯ РІЗНИХ ОЗНАК ТЕКСТУ, МОДЕЛЕЙ ТА МЕТОДІВ ДЛЯ РОЗПІЗНАВАННЯ АВТОРА ТЕКСТУ

СПИСОК ЛІТЕРАТУРИ

ПОРІВНЯЛЬНИЙ АНАЛІЗ ВИКОРИСТАННЯ РІЗНИХ ОЗНАК ТЕКСТУ,
МОДЕЛЕЙ ТА МЕТОДІВ ДЛЯ РОЗПІЗНАВАННЯ АВТОРА ТЕКСТУ