C&SA | Contents

Volume 60 >>> № 5 SEPTEMBER — OCTOBER 2024

-->

UDC 004.912

R.B. Azimov¹

¹ Institute of Control Systems of the Ministry of Sciences and Education of Azerbaijan, Baku, Azerbaijan

rustemazimov1999@gmail.com

COMPARATIVE ANALYSIS OF USING DIFFERENT TEXT FEATURES,
MODELS, AND METHODS IN TEXT AUTHOR RECOGNITION

Abstract. Various methods and models are used in the text author recognition computer system for recognizing the authorship of texts on the example of Azerbaijani writers. A comparative analysis is carried out for the efficiency of using different text features and proposed feature selection procedures. Computer experiments are conducted using the works of several famous Azerbaijani writers in Azerbaijani language, and the results are analyzed.

Keywords: author recognition, author identification, authorship recognition of literary works, text feature engineering.

full text

REFERENCES

1. Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology. 2009. Vol. 60, N 3. P. 538–556. doi.org/10.1002/asi.21001.

2. Mosteller F., Wallace D.L. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers. Journal of the American Statistical Association. 1963. Vol. 58, N 302. P. 275–309. doi.org/10.1080/01621459.1963.10500849.

3. Diederich J., Kindermann J., Leopold E., Paass G. Authorship attribution with support vector machines. Applied Intelligence. 2003. Vol. 19, N 1. P. 109–123. doi.org/10.1023/A:1023824908771.

4. Doghan S., Diri B. A new N-gram based classification (Ng-ind) for Turkish documents: author, genre and gender. Turkish Informatics Foundation Journal of Computer Science and Engineering. 2010. Vol. 3, N 1. P. 11–19 (in Turkish). URL: dergipark.org.tr/tr/pub/tbbmd/issue/22242/238775.

5. Levent V.E., Diri B. Author recognition in Turkish documents with artificial neural networks. Proc. XVI Academic Informatics Conference (5–7 February 2014, Mersin, Turkey). Mersin, 2014. P. 735–741 (in Turkish). URL: ab.org.tr/ab14/kitap/levent_diri_ab14.pdf .

6. Yasdi M., Diri B. Author recognition by abstract feature extraction. Proc. 2012 20th Signal Processing and Communications Applications Conference (SIU) (18–20 April 2012, Mugla, Turkey). Mugla, 2012. P. 1–4 (in Turkish). doi.org/10.1109/SIU.2012.6204690.

7. Bilgin M. A novel method proposal to increase the classification success of Turkish texts. Uludag University Journal of the Faculty of Engineering. 2019. Vol. 24, N 1. P. 125–136 (in Turkish). doi.org/10.17482/uumfd.484525.

8. Erdoghan I., Gullu M., Polat H. Developing an end-to-end author recognition application with machine learning algorithms. El-Cezeri Journal of Science and Engineering. 2022. Vol. 9, N 4. P. 1303–1314 (in Turkish). doi.org/10.31202/ecjse.1134698.

9. Graovac J. Text categorization using n-gram based language independent technique. Intelligent Data Analysis. 2014. Vol. 18, N 4. P. 677–695. URL: https://api.semanticscholar.org/CorpusID:7162968.

10. Halvani O., Winter C., Pflug A. Authorship verification for different languages, genres and topics. Proc. Third Annual DFRWS Europe (29–31 March 2016, Lausanne, Switzerland). Lausanne, 2016. P. 33–43. https://doi.org/10.1016/j.diin.2016.01.006.

11. Keselj V., Peng F., Cercone N., Thomas C. N-gram-based author profiles for authorship attribution. Proc. Conference Pacific Association for Computational Linguistics (PACLING) (22–25 August 2003, Halifax, Nova Scotia, Canada). Halifax, 2003. Vol. 3. P. 255–264. URL: https://web.cs.dal.ca/~vlado/papers/pacling03.pdf .

12. Mendenhall T.C. The characteristic curves of composition. Science. 1887. Vol. ns-9, N 214s. P. 237–246. https://doi.org/10.1126/science.ns-9.214S.237.

13. Mendenhall T.C. A mechanical solution of a literary problem. The Popular Science Monthly. 1901. Vol. 60. P. 97–105. URL: https://api.semanticscholar.org/CorpusID:121781963.

14. Zhao Y., Zobel J. Searching with style: Authorship attribution in classic literature. Proc. Thirtieth Australasian Conference on Computer Science (ACSC2007) (30 January – 2 February 2007, Ballarat, Australia). Ballarat, 2007. Vol. 62. P. 59–68. URL: https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV62Zhao.pdf .

15. Aida-zade K.R., Mustafayev E.M., Azimov R.B. Features analysis for application in a computer recognition systems of Azerbaijani texts authorship. Proc. Second International Bilateral Workshop on Science Between Dokuz Eyll University and Azerbaijan National Academy of Sciences (18 November 2022, virtual event). P. 11. URL: https://drive.google.com/ file/d/1ThwglS3wxd-cevC_c7Y5qgCV3ZNCHNJ6/view?usp=drive_link .

16. Mustafayev E.M., Azimov R.B. Comparative analysis of different feature sets for use in a computer system that recognizes authorship of texts in Azerbaijani language. Proc. II Republican Scientific Conference on “Fundamental Problems of Mathematics and Application of Intellectual Technologies in Education” (15–6 December 2022, Sumgayit, Azerbaijan). Sumgayit, 2022. P. 34–39 (in Azerbaijani). URL: https://www.ssu-conferenceproceedings.edu.az/pdf/riyaziyyat2022.pdf .

17. Anisimov A.V., Porkhun E.V., Taranukha V.Yu. Algorithm for construction of parametric vectors for solution of classification problems by a feed-forward neural network. Cybernetics and Systems Analysis. 2007. Vol. 43, N 2. P. 161–170. https://doi.org/10.1007/s10559-007-0035-9.

18. Howedi F., Mohd M. Text classification for authorship attribution using Naive Bayes classifier with limited training data. Computer Engineering and Intelligent Systems. 2014. Vol. 5, N 4. P. 48–56. URL: https://api.semanticscholar.org/CorpusID:54823714.

19. Ayda-zade K., Talibov S. Аnalysis of the methods for the authorship identification of the text in the Azerbaijani language. Problems of Information Technology. 2017. Vol. 8, N 1. P. 14–23. https://doi.org/10.25045/jpit.v08.i1.02.

20. Marchenko O., Anisimov A., Nykonenko A., Rossada T., Melnikov E. Authorship attribution system. Proc. 22nd International Conference on Applications of Natural Language to Information Systems (NLDB 2017) (21–23 June 2017, LiЩge, Belgium). LiЩge, 2022. LNISA. 2017. Vol. 10260. P. 227–231. https://doi.org/10.1007/978-3-319-59569-6_27.

21. Borisov E.S. Using artificial neural networks for classification of black-and-white images. Cybernetics and Systems Analysis. 2008. Vol. 44, N 2. P. 304–307. https://doi.org/ 10.1007/s10559-008-0030-9.

22. Aida-zade K.R., Rustamov S.S., Clements M.A., Mustafayev E.E. Adaptive neuro-fuzzy inference system for classification of texts. In: Recent Developments and the New Direction in Soft-Computing Foundations and Applications. Zadeh L., Yager R., Shahbazova S., Reformat M., Kreinovich V. (Eds.). STUDFUZZ. 2018. Vol. 361. P. 63–70. https://doi.org/10.1007/978-3-319-75408-6_6.

23. Dabagh R.M. Authorship attribution and statistical text analysis. Advances in Methodology and Statistics. 2007. Vol. 4, N 2. P. 149–163. URL: https://old.stat-d.si/mz/mz4.1/dabagh.pdf .

24. Orucu F., Dalkilich G. Author identification using N-grams and SVM. Proc. of the 1st International Symposium on Computing in Science & Engineering (ISCSE) (3–5 June 2010, Izmir, Turkey). Izmir, 2010. P. 3–5.

25. Stamatatos E. Author identification using imbalanced and limited training texts. Proc. 18th International Workshop on Database and Expert Systems Applications (DEXA 2007) (03-07 September 2007, Regensburg, Germany). Regensburg, 2007. P. 237–241. https://doi.org/10.1109/DEXA.2007.5.

26. Stamatatos E. Ensemble-based author identification using character n-grams. Proc. 3rd International Workshop on Text-based Information Retrieval (29 August 2006, Riva del Garda, Italy). Riva del Garda, 2006. P. 41–46. URL: https://downloads.webis.de/publications/papers/stein_2006f.pdf#page=45.

27. Lupei M., Mitsa A., Repariuk V., Sharkan V. Identification of authorship of Ukrainian-language texts of journalistic style using neural networks. Eastern-European Journal of Enterprise Technologies. 2020. Vol. 1, N 2 (103). P. 30–36. https://doi.org/10.15587/1729-4061. 2020.195041.

28. Yule G.U. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika. 1939. Vol. 30, N 3/4. P. 363–390. https://doi.org/10.2307/2332655.

29. Wilcox R.R. Fundamentals of modern statistical methods: substantially improving power and accuracy. New York: Springer, 2010. 249 p. https://doi.org/10.1007/978-1-4419-5525-8.

30. Kingma D.P., Ba J. Adam: a method for stochastic optimization. arXiv:1412.6980v9 [cs.LG] 30 Jan 2017. https://doi.org/10.48550/arXiv.1412.6980.

31. Geron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. 2nd ed. O’Reilly Media, 2019. 848 p. URL: https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632.

32. Pedregosa F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011. Vol. 12. P. 2825–2830. URL: https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf?.

33. Chollet F. Keras. 2015. URL: https://github.com/fchollet/keras .

UDC 004.912

COMPARATIVE ANALYSIS OF USING DIFFERENT TEXT FEATURES, MODELS, AND METHODS IN TEXT AUTHOR RECOGNITION

REFERENCES

COMPARATIVE ANALYSIS OF USING DIFFERENT TEXT FEATURES,
MODELS, AND METHODS IN TEXT AUTHOR RECOGNITION