DOI
10.34229/KCA2522-9664.26.3.4
UDC 004.8
D. Yuvzhenko
National Technical University “Igor Sikorsky Kyiv Polytechnic Institute,”
Kyiv, Ukraine,
d.yuvzhenko@kpi.ua
S. Stirenko
National Technical University “Igor Sikorsky Kyiv Polytechnic Institute,”
Kyiv, Ukraine,
s.stirenko@kpi.ua
A COMPARATIVE STUDY OF CHUNKING STRATEGIES
FOR RETRIEVAL-AUGMENTED GENERATION
Abstract. An empirical comparative study of four document segmentation strategies is presented: fixed windows of 256, 512, and 1024 tokens, and semantic segmentation based on a large language model. Experiments were conducted on long, semantically coherent texts from the SQuALITY dataset. The evaluation was performed on 225 question–answer pairs using Precision@5 and Recall@5 (top-5 retrieval metrics), answer quality metrics (Exact Match and token-level F1), and average retrieval latency. The results reveal a clear trade-off between retrieval precision and recall driven by granularity: smaller fragments provide higher precision, whereas larger fragments substantially increase recall and improve answer quality in terms of F1. Within this experimental setting, semantic segmentation demonstrates competitive results but does not show a consistent advantage over fixed windows of 512–1024 tokens. A reduction in retrieval latency is observed when using larger segments, which can be explained by lower vector-index density. A reproducible evaluation procedure and practical recommendations for selecting a segmentation strategy for efficient RAG systems are provided.
Keywords: Retrieval-Augmented Generation, RAG, chunking, semantic search, long-document question answering, chunking strategies.
full text
REFERENCES
- 1. Lewis P., Perez E., Piktus А. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020. arXiv.2005.11401. https://doi.org/10.48550/arXiv.2005.11401.
- 2. Guu K., Lee K., Tung Z., Pasupat P., Chang M.-W. Retrieval augmented language model pre-training. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020). (Virtual Conference, 13–18 July 2020). PMLR. 2020. Vol. 119. P. 3929–3938. URL: https://proceedings.mlr.press/v119/guu20a.html.
- 3. Qu R., Bao F., Tu R. Is semantic chunking worth the computational cost? arXiv:2410.13070. 2024. https://doi.org/10.48550/arXiv.2410.13070.
- 4. Schwaber-Cohen R., Patel A. Chunking strategies for LLM applications. Pinecone systems inc. 2025. URL: https://www.pinecone.io/learn/chunking-strategies/.
- 5. Khattab O., Zaharia M. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). (Virtual Event, 25–30 July 2020, China). Association for Computing Machinery (ACM). 2020. P. 39–48. https://doi.org/10.48550/arXiv.2004.12832.
- 6. Yuvzhenko D., Chymshyr V., Shymkovych V. et al. A multimodal retrieval-augmented generation system with ReAct agent logic for multi-hop reasoning. Inf. Comput. and Intell. Syst. 2025. N 6. P. 42–57. https://doi.org/10.20535/2786-8729.6.2025.330777.
- 7. Izacard G., Grave E. Leveraging passage retrieval with generative models for open domain question answering. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL 2021). (Online, April 2021). Association for Computational Linguistics. 2021. P. 874–880. https://doi.org/10.18653/v1/2021.eacl-main.74. URL: https://aclanthology.org/2021.eacl-main.74/.
- 8. Karpukhin V., Oguz B., Min S.et al. Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020). (Online, November 2020). Association for Computational Linguistics. 2020. P. 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550. URL: https://aclanthology.org/2020.emnlp-main.550/.
- 9. Kocisky T., Schwarz J., Blunsom P. et al. The NarrativeQA reading comprehension challenge. TACL. 2018. Vol. 6. P. 317–328. https://doi.org/10.1162/tacl_a_00023.
- 10. Dasigi P., Lo K., Beltagy I. et al. Dataset of information-seeking questions and answers anchored in research papers. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021). (Online, June 2021). Association for Computational Linguistics. 2021. P. 4599–4610. https://doi.org/10.18653/v1/2021.naacl-main.365. URL: https://aclanthology.org/2021.naacl-main.365/.
- 11. Angelidis S., Frermann L., Marcheggiani D. et al. Book QA: Stories of challenges and opportunities. In: Proceedings of the 2nd Workshop on Machine Reading for Question Answering (MRQA 2019). (November 2019, Hong Kong, China). Association for Computational Linguistics. 2019. P. 78–85. https://doi.org/10.18653/v1/D19-5811 URL: https://aclanthology.org/D19-5811/.
- 12. Wang A., Pang R.Y., Chen A. et al. SQuALITY: Building a long-document summarization dataset the hard way. https://doi.org/10.48550/arXiv.2205.11465.
- 13. OpenAI. GPT-4o mini model documentation. URL: https://platform.openai.com/docs/models/gpt-4o-mini. (Accessed 2025-12-25).
- 14. Rajpurkar P., Zhang J., Lopyrev K., Liang P. SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016). (November 2016 Austin, Texas, USA). Association for Computational Linguistics. 2016. P. 2383–2392. https://doi.org/10.18653/v1/D16-1264. URL: https://aclanthology.org/D16-1264/.
- 15. Vector search in Azure AI Search. Microsoft. 2024. URL: https://learn.microsoft.com/en-us/azure/search/vector-search-overview.
- 16. OpenAI. text-embedding-3-large model documentation. 2025. URL: https://platform.openai.com/docs/models/text-embedding-3-large.
- 17. Content filtering overview – Azure OpenAI. Microsoft. 2025. URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/content-filter.
- 18. Yuvzhenko D. Comparative study of chunking strategies for retrieval-augmented generation: Experimental code and data repository. 2025. URL: https://github.com/denys-yu/rag-chunk-paper.