|
|
Text Retrieval Based on Syntactic Information |
Zhang Yongwei1,2,Liu Ting1,Liu Chang3,Wu Bingxin3,Yu Jingsong3( ) |
1School of Chinese Language and Literature, University of Chinese Academy of Social Sciences, Beijing 102488, China 2Corpus and Computational Linguistics Center, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing 100732, China 3School of Software and Microelectronics, Peking University, Beijing 100871, China |
|
|
Abstract [Objective] This study aims to explore an efficient method for retrieving syntactic information from large text corpora. [Methods] First, we created linearized indices for syntactic information based on their features. Then, these indices provide matching information to improve retrieval efficiency. [Results] We examined our new model with the People’s Daily Corpus of 28.51 million sentences. The average processing time for 26 queries was 802.6 milliseconds, which met the requirements of retrieval systems for large corpora. [Limitations] More research is needed to evaluate the proposed method with larger number of queries. [Conclusions] Our new method could quickly retrieve lexical, dependency syntactic and constituency syntactic information from large text corpora.
|
Received: 02 February 2022
Published: 13 January 2023
|
|
Fund:“The New Generation of Artificial Intelligence”, Major Project of National Science and Technology Innovation 2030(2020AAA0109703);Commissioned Research Project for Year 2020, Affiliated to the 13th Five-Year Plan for Science and Research of National Language Commission(WT135-69);Major Project of the National Social Science Fund of China(21&ZD294) |
Corresponding Authors:
Yu Jingsong
E-mail: yjs@ss.pku.edu.cn
|
[1] |
黄水清, 王东波. 国内语料库研究综述[J]. 信息资源管理学报, 2021, 11(3): 4-17.
|
[1] |
(Huang Shuiqing, Wang Dongbo. Review of Corpus Research in China[J]. Journal of Information Resources Management, 2021, 11(3): 4-17.)
|
[2] |
Che W X, Feng Y L, Qin L B, et al. N-LTP: An Open-Source Neural Language Technology Platform for Chinese[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing:System Demonstrations. 2021: 42-49.
|
[3] |
Straka M, Straková J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe[C]// Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. 2017: 88-99.
|
[4] |
Manning C, Surdeanu M, Bauer J, et al. The Stanford CoreNLP Natural Language Processing Toolkit[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics:System Demonstrations. 2014: 55-60.
|
[5] |
Bird S, Klein E, Loper E. Natural Language Processing with Python[M]. California: O’Reilly Media, Inc., 2009.
|
[6] |
Hardie A. CQPweb—Combining Power, Flexibility and Usability in a Corpus Analysis Tool[J]. International Journal of Corpus Linguistics, 2012, 17(3): 380-409.
doi: 10.1075/ijcl.17.3.04har
|
[7] |
Davies M. Corpus of Global Web-Based English(GloWbE)[EB/OL]. [2021-10-01]. https://www.english-corpora.org/glowbe/.
|
[8] |
Davies M. The iWeb Corpus[EB/OL]. [2021-10-01]. https://www.english-corpora.org/iWeb/.
|
[9] |
Kilgarriff A, Baisa V, Bušta J, et al. The Sketch Engine: Ten Years on[J]. Lexicography, 2014, 1(1): 7-36.
doi: 10.1007/s40607-014-0009-9
|
[10] |
詹卫东, 郭锐, 常宝宝, 等. 北京大学CCL语料库的研制[J]. 语料库语言学, 2019, 6(1): 71-86, 116.
|
[10] |
(Zhan Weidong, Guo Rui, Chang Baobao, et al. The Building of the CCL Corpus: Its Design and Implementation[J]. Corpus Linguistics, 2019, 6(1): 71-86, 116.)
|
[11] |
荀恩东, 饶高琦, 肖晓悦, 等. 大数据背景下BCC语料库的研制[J]. 语料库语言学, 2016, 3(1): 93-109, 118.
|
[11] |
(Xun Endong, Rao Gaoqi, Xiao Xiaoyue, et al. The Construction of the BCC Corpus in the Age of Big Data[J]. Corpus Linguistics, 2016, 3(1): 93-109, 118.)
|
[12] |
肖航. 现代汉语通用平衡语料库建设与应用[J]. 华文世界, 2010(106): 24-29.
|
[12] |
(Xiao Hang. On the Construction and Application of Contemporary Chinese Corpus[J]. Journal of Chinese World, 2010(106): 24-29.)
|
[13] |
Luotolahti J, Kanerva J, Pyysalo S, et al. SETS: Scalable and Efficient Tree Search in Dependency Graphs[C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Demonstrations. 2015: 51-55.
|
[14] |
Luotolahti J, Kanerva J, Ginter F. Dep_Search: Efficient Search Tool for Large Dependency Parsebanks[C]// Proceedings of the 21st Nordic Conference on Computational Linguistics. 2017: 255-258.
|
[15] |
Valenzuela-Escárcega M A, Hahn-Powell G, Surdeanu M. Odin’s Runes: A Rule Language for Information Extraction[C]// Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016: 322-329.
|
[16] |
Valenzuela-Escárcega M A, Hahn-Powell G, Bell D. Odinson: A Fast Rule-Based Information Extraction Framework[C]// Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 2183-2191.
|
[17] |
Shlain M, Taub-Tabib H, Sadde S, et al. Syntactic Search by Example[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations. 2020: 17-23.
|
[18] |
Petersen U. Querying Both Parallel and Treebank Corpora: Evaluation of a Corpus Query System[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006: 2457-2459.
|
[19] |
Petersen U. Emdros: A Text Database Engine for Analyzed or Annotated Text[C]// Proceedings of the 20th International Conference on Computational Linguistics. 2004: 1190-1193.
|
[20] |
Augustinus L, Vandeghinste V, van Eynde F. Example-Based Treebank Querying[C]// Proceedings of the 8th International Conference on Language Resources and Evaluation. 2012: 3161-3167.
|
[21] |
Augustinus L, Vandeghinste V, Schuurman I, et al. GrETEL. A Tool for Example-Based Treebank Mining[A]// OdijkJ, van HessenA. CLARIN in the Low Countries[M]. London: Ubiquity Press, 2017:269-280.
|
[22] |
Brants S, Dipper S, Eisenberg P, et al. TIGER: Linguistic Interpretation of a German Corpus[J]. Research on Language and Computation, 2004, 2(4): 597-620.
doi: 10.1007/s11168-004-7431-3
|
[23] |
Maryns H, Kepser S. MonaSearch—A Tool for Querying Linguistic Treebanks[C]// Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories. 2009: 29-40.
|
[24] |
Mírovský J. Netgraph—A Tool for Searching in the Prague Dependency Treebank 2.0[D]. Prague: Charles University, 2008.
|
[25] |
马路遥, 夏博, 肖叶, 等. 面向句法结构的文本检索方法研究[J]. 电子学报, 2020, 48(5): 833-839.
|
[25] |
(Ma Luyao, Xia Bo, Xiao Ye, et al. Structural Retrieval on Chinese Syntax Tree Corpus[J]. Acta Electronica Sinica, 2020, 48(5): 833-839.)
|
[26] |
Petersen U. Evaluating Corpus Query Systems on Functionality and Speed: TIGERSearch and Emdros[C]// Proceedings of the 2005 International Conference Recent Advances in Natural Language Processing. 2005: 387-391.
|
[27] |
Evert S, The CWB Development Team. CQP Interface and Query Language Manual[EB/OL]. [2021-12-02]. https://cwb.sourceforge.io/files/CQP_Tutorial.pdf.
|
[28] |
Machálek T. KonText: Advanced and Flexible Corpus Query Interface[C]// Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 7003-7008.
|
[29] |
de Does J, Niestadt J, Depuydt K. Creating Research Environments with BlackLab[A]//Odijk J, van Hessen A. CLARIN in the Low Countries.[M]. London: Ubiquity Press, 2017: 245-257.
|
[30] |
Rohde D L T. Tgrep2 User Manual[OL]. https://web.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html.
|
[31] |
Levy R, Andrew G. Tregex and Tsurgeon: Tools for Querying and Manipulating Tree Data Structures[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006: 2231-2234.
|
[32] |
Lai C, Bird S. Querying Linguistic Trees[J]. Journal of Logic, Language and Information, 2009, 19(1): 53-73.
doi: 10.1007/s10849-009-9086-9
|
[33] |
Lai C, Bird S. Querying and Updating Treebanks: A Critical Survey and Requirements Analysis[C]// Proceedings of the 2004 Australasian Language Technology Workshop. 2004: 139-146.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|