Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (11): 25-37    DOI: 10.11925/infotech.2096-3467.2022.0093
Current Issue | Archive | Adv Search |
Text Retrieval Based on Syntactic Information
Zhang Yongwei1,2,Liu Ting1,Liu Chang3,Wu Bingxin3,Yu Jingsong3()
1School of Chinese Language and Literature, University of Chinese Academy of Social Sciences, Beijing 102488, China
2Corpus and Computational Linguistics Center, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing 100732, China
3School of Software and Microelectronics, Peking University, Beijing 100871, China
Download: PDF (1388 KB)   HTML ( 15
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study aims to explore an efficient method for retrieving syntactic information from large text corpora. [Methods] First, we created linearized indices for syntactic information based on their features. Then, these indices provide matching information to improve retrieval efficiency. [Results] We examined our new model with the People’s Daily Corpus of 28.51 million sentences. The average processing time for 26 queries was 802.6 milliseconds, which met the requirements of retrieval systems for large corpora. [Limitations] More research is needed to evaluate the proposed method with larger number of queries. [Conclusions] Our new method could quickly retrieve lexical, dependency syntactic and constituency syntactic information from large text corpora.

Key wordsDependency Syntax      Constituency Syntax      Corpus      Index      Retrieval     
Received: 02 February 2022      Published: 13 January 2023
ZTFLH:  TP393  
  G250  
Fund:“The New Generation of Artificial Intelligence”, Major Project of National Science and Technology Innovation 2030(2020AAA0109703);Commissioned Research Project for Year 2020, Affiliated to the 13th Five-Year Plan for Science and Research of National Language Commission(WT135-69);Major Project of the National Social Science Fund of China(21&ZD294)
Corresponding Authors: Yu Jingsong     E-mail: yjs@ss.pku.edu.cn

Cite this article:

Zhang Yongwei,Liu Ting,Liu Chang,Wu Bingxin,Yu Jingsong. Text Retrieval Based on Syntactic Information. Data Analysis and Knowledge Discovery, 2022, 6(11): 25-37.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0093     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I11/25

Schematic Diagram of Lexical Analysis, Dependency Parsing and Constituent Parsing
Multi-Field Inverted Index Structure
位置字段 词法信息字段 依存句法信息字段
位置
index
词形
word
词性
pos
支配词位置
gov_index
支配词词性
gov_pos
支配词词形
gov_word
依存关系
rel
1 PN 2 VV nsubj
2 VV 0 root
3 AS 2 VV aux:asp
4 CD 6 NN 毛衣 nummod
5 M 4 CD mark:clf
6 毛衣 NN 2 VV dobj
7 PU 2 VV punct
Linear Fields of Syntactic Dependency Information of Example Sentence
Match a Dependent
Match a Head
Match a Compound Condition
Constituency Parse Tree with Position Information of Example Sentence
位置字段 词法信息字段 成分句法信息字段
位置
index
词形
word
词性
pos
短语_起始
phr_b
短语_末尾
phr_e
短语_独立
phr_o
短语起始
phrb
短语末尾
phre
短语独立
phro
1 PN IP_1_7 NP_1_1 IP NP
2 VV VP_2_6 VP
3 AS
4 CD QP_4_5|
NP_4_6
QP
NP
5 M QP_4_5 CLP_5_5 QP CLP
6 毛衣 NN NP_4_6|
VP_2_6
NP_6_6 NP
VP
NP
7 PU IP_1_7 IP
Linear Fields of Constituency Syntactic Information of Example Sentence
Match a Noun Phrase Consisting of Multiple Words
Match the “有VP的NP” Pattern
序号 检索条件 结果数 用时(单位:毫秒)
马路遥等的系统[25] Odinson[16] Dep_Search[14] 本系统
1 “包装” 19 905 187.5 1 463.1 21.6
2 名词“包装” 17 980 350.0 1 567.1 120.5
3 “锻炼身体” 1 769 279.0 27.4
4 “对”+任意词语+“的侵略” 11 412 45 450.0 1 997.5 258.3
5 “用”+任意词语+“吃饭” 664 547.5 112.4
6 “打击”+任意词语+“敌人” 7 243 998.0 135.9
平均值 9 828.8 45 450.0 726.6 1 515.1 112.7
Average Time of Lexical Information Retrieval
序号 检索条件 结果数 用时(单位:毫秒)
Odinson Dep_Search 本系统
1 动词“买”的直接宾语 77 901 1 530.5 3 518.7 246.1
2 修饰“企业”的形容词 257 694 2 147.5 6 418.9 176.6
3 “有”的直接宾语 2 407 046 9 030.5 25 449.8 988.5
4 “衣服”作为直接宾语时的动词 12 783 955.0 2 388.3 74.2
5 “问题”作为直接宾语时的动词 681 439 6 175.5 17 251.5 412.4
6 形容词“新”修饰的词语 1 442 072 7 361.5 20 623.0 345.0
7 名词性主语+“买”+直接宾语 21 875 1 319.0 3 357.8 290.2
8 “新”修饰的“中国”作为名词性主语 24 638 1 472.5 7 281.2 2 223.8
9 名词性主语+“有”+直接宾语 700 045 8 115.0 37 130.8 4 928.5
平均值 625 054.8 4 234.1 13 713.3 1 076.1
Average Time of Dependency Syntactic Information Retrieval
序号 检索条件 结果数 用时(单位:毫秒)
Odinson 本系统
1 以“灯”结尾的、两个词组成的名词短语 5 873 672.0 147.3
2 以“灯”结尾的名词短语 11 960 820.5 482.5
3 以“打击”起始的、两个词组成的动词短语 14 857 963.5 464.0
4 以“打击”起始的动词短语 57 976 771.0 2 759.2
5 “打击”+单个词组成的名词短语 19 625 1 178.5 265.8
6 “打击”+多个词组成的名词短语 33 007 969.0 1 990.1
7 “打击”+名词短语 52 612 1 345.0 2 417.1
8 单个词组成的形容词短语+“的”+“记忆” 562 529.0 111.0
9 多个词组成的形容词短语+“的”+“记忆” 38 296.0 176.6
10 形容词短语+“的”+“记忆” 600 685.0 197.1
11 “以”+任意词语+“为主的”+名词短语 11 623 5 840.0 1 496.0
平均值 18 975.7 1 279.0 955.2
Average Time of Constituency Syntactic Information Retrieval
类型 检索条件 结果数
词法信息 名词“包装” 17 980
“锻炼身体” 1 769
“对”+任意词语+“的侵略” 11 412
依存句法信息 动词“买”的直接宾语 77 901
“衣服”作为直接宾语时的动词 12 783
名词性主语+“买”+直接宾语 21 875
成分句法信息 以“起源地”结尾的名词短语 71
形容词短语+“的”+“记忆” 600
“以”+任意词语+“为主的”+名词短语 11 623
合计 156 014
Results of Performance Retrieval
[1] 黄水清, 王东波. 国内语料库研究综述[J]. 信息资源管理学报, 2021, 11(3): 4-17.
[1] (Huang Shuiqing, Wang Dongbo. Review of Corpus Research in China[J]. Journal of Information Resources Management, 2021, 11(3): 4-17.)
[2] Che W X, Feng Y L, Qin L B, et al. N-LTP: An Open-Source Neural Language Technology Platform for Chinese[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing:System Demonstrations. 2021: 42-49.
[3] Straka M, Straková J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe[C]// Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. 2017: 88-99.
[4] Manning C, Surdeanu M, Bauer J, et al. The Stanford CoreNLP Natural Language Processing Toolkit[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics:System Demonstrations. 2014: 55-60.
[5] Bird S, Klein E, Loper E. Natural Language Processing with Python[M]. California: O’Reilly Media, Inc., 2009.
[6] Hardie A. CQPweb—Combining Power, Flexibility and Usability in a Corpus Analysis Tool[J]. International Journal of Corpus Linguistics, 2012, 17(3): 380-409.
doi: 10.1075/ijcl.17.3.04har
[7] Davies M. Corpus of Global Web-Based English(GloWbE)[EB/OL]. [2021-10-01]. https://www.english-corpora.org/glowbe/.
[8] Davies M. The iWeb Corpus[EB/OL]. [2021-10-01]. https://www.english-corpora.org/iWeb/.
[9] Kilgarriff A, Baisa V, Bušta J, et al. The Sketch Engine: Ten Years on[J]. Lexicography, 2014, 1(1): 7-36.
doi: 10.1007/s40607-014-0009-9
[10] 詹卫东, 郭锐, 常宝宝, 等. 北京大学CCL语料库的研制[J]. 语料库语言学, 2019, 6(1): 71-86, 116.
[10] (Zhan Weidong, Guo Rui, Chang Baobao, et al. The Building of the CCL Corpus: Its Design and Implementation[J]. Corpus Linguistics, 2019, 6(1): 71-86, 116.)
[11] 荀恩东, 饶高琦, 肖晓悦, 等. 大数据背景下BCC语料库的研制[J]. 语料库语言学, 2016, 3(1): 93-109, 118.
[11] (Xun Endong, Rao Gaoqi, Xiao Xiaoyue, et al. The Construction of the BCC Corpus in the Age of Big Data[J]. Corpus Linguistics, 2016, 3(1): 93-109, 118.)
[12] 肖航. 现代汉语通用平衡语料库建设与应用[J]. 华文世界, 2010(106): 24-29.
[12] (Xiao Hang. On the Construction and Application of Contemporary Chinese Corpus[J]. Journal of Chinese World, 2010(106): 24-29.)
[13] Luotolahti J, Kanerva J, Pyysalo S, et al. SETS: Scalable and Efficient Tree Search in Dependency Graphs[C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Demonstrations. 2015: 51-55.
[14] Luotolahti J, Kanerva J, Ginter F. Dep_Search: Efficient Search Tool for Large Dependency Parsebanks[C]// Proceedings of the 21st Nordic Conference on Computational Linguistics. 2017: 255-258.
[15] Valenzuela-Escárcega M A, Hahn-Powell G, Surdeanu M. Odin’s Runes: A Rule Language for Information Extraction[C]// Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016: 322-329.
[16] Valenzuela-Escárcega M A, Hahn-Powell G, Bell D. Odinson: A Fast Rule-Based Information Extraction Framework[C]// Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 2183-2191.
[17] Shlain M, Taub-Tabib H, Sadde S, et al. Syntactic Search by Example[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations. 2020: 17-23.
[18] Petersen U. Querying Both Parallel and Treebank Corpora: Evaluation of a Corpus Query System[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006: 2457-2459.
[19] Petersen U. Emdros: A Text Database Engine for Analyzed or Annotated Text[C]// Proceedings of the 20th International Conference on Computational Linguistics. 2004: 1190-1193.
[20] Augustinus L, Vandeghinste V, van Eynde F. Example-Based Treebank Querying[C]// Proceedings of the 8th International Conference on Language Resources and Evaluation. 2012: 3161-3167.
[21] Augustinus L, Vandeghinste V, Schuurman I, et al. GrETEL. A Tool for Example-Based Treebank Mining[A]// OdijkJ, van HessenA. CLARIN in the Low Countries[M]. London: Ubiquity Press, 2017:269-280.
[22] Brants S, Dipper S, Eisenberg P, et al. TIGER: Linguistic Interpretation of a German Corpus[J]. Research on Language and Computation, 2004, 2(4): 597-620.
doi: 10.1007/s11168-004-7431-3
[23] Maryns H, Kepser S. MonaSearch—A Tool for Querying Linguistic Treebanks[C]// Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories. 2009: 29-40.
[24] Mírovský J. Netgraph—A Tool for Searching in the Prague Dependency Treebank 2.0[D]. Prague: Charles University, 2008.
[25] 马路遥, 夏博, 肖叶, 等. 面向句法结构的文本检索方法研究[J]. 电子学报, 2020, 48(5): 833-839.
[25] (Ma Luyao, Xia Bo, Xiao Ye, et al. Structural Retrieval on Chinese Syntax Tree Corpus[J]. Acta Electronica Sinica, 2020, 48(5): 833-839.)
[26] Petersen U. Evaluating Corpus Query Systems on Functionality and Speed: TIGERSearch and Emdros[C]// Proceedings of the 2005 International Conference Recent Advances in Natural Language Processing. 2005: 387-391.
[27] Evert S, The CWB Development Team. CQP Interface and Query Language Manual[EB/OL]. [2021-12-02]. https://cwb.sourceforge.io/files/CQP_Tutorial.pdf.
[28] Machálek T. KonText: Advanced and Flexible Corpus Query Interface[C]// Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 7003-7008.
[29] de Does J, Niestadt J, Depuydt K. Creating Research Environments with BlackLab[A]//Odijk J, van Hessen A. CLARIN in the Low Countries.[M]. London: Ubiquity Press, 2017: 245-257.
[30] Rohde D L T. Tgrep2 User Manual[OL]. https://web.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html.
[31] Levy R, Andrew G. Tregex and Tsurgeon: Tools for Querying and Manipulating Tree Data Structures[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006: 2231-2234.
[32] Lai C, Bird S. Querying Linguistic Trees[J]. Journal of Logic, Language and Information, 2009, 19(1): 53-73.
doi: 10.1007/s10849-009-9086-9
[33] Lai C, Bird S. Querying and Updating Treebanks: A Critical Survey and Requirements Analysis[C]// Proceedings of the 2004 Australasian Language Technology Workshop. 2004: 139-146.
[1] Wang Li, Liu Xiwen. Measuring Diffusion of Technology Topics with Patent Data[J]. 数据分析与知识发现, 2022, 6(6): 1-10.
[2] Ding Shengchun, You Weijing, Wang Xiaoying. Extracting Weapon Attributes Based on Word Completion[J]. 数据分析与知识发现, 2022, 6(2/3): 289-297.
[3] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[4] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[5] Lu Linong,Zhu Zhongming,Zhang Wangqiang,Wang Xiaochun. Cross-database Knowledge Integration and Fingerprint of Institutional Repositories with Lingo3G Clustering Algorithm[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[6] Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[7] Li Yueyan,Wang Hao,Deng Sanhong,Wang Wei. Research Trends of Information Retrieval——Case Study of SIGIR Conference Papers[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[8] Zhu Lu, Deng Fang, Liu Kun, He Tingting, Liu Yuanyuan. Cross-Modal Retrieval Based on Semantic Auto-Encoder and Hash Learning[J]. 数据分析与知识发现, 2021, 5(12): 110-122.
[9] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[10] Xu Yicong,Tian Xuedong,Li Xinfu,Yang Fang,Shi Qingxuan. Retrieving Mathematical Expressions Based on Hesitant Fuzzy Weight[J]. 数据分析与知识发现, 2020, 4(7): 118-126.
[11] Weng Mengjuan,Yao Changqing,Han Hongqi,Wang Lijun,Ran Yaxin. Classification and Indexing Method with CNN for Imbalanced Datasets[J]. 数据分析与知识发现, 2020, 4(7): 87-95.
[12] Li Keyu,Wang Hao,Gong Lijuan,Tang Huihui. Measurement and Distribution of Index Quality in Research Topics from Academic Databases[J]. 数据分析与知识发现, 2020, 4(6): 91-108.
[13] Zhu Lu,Tian Xiaomeng,Cao Sainan,Liu Yuanyuan. Subspace Cross-modal Retrieval Based on High-Order Semantic Correlation[J]. 数据分析与知识发现, 2020, 4(5): 84-91.
[14] Xiong Xin,Wang Hao,Zhang Haichao,Zhang Baolong. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity[J]. 数据分析与知识发现, 2020, 4(2/3): 143-152.
[15] Huang Wei,Zhao Jiangyuan,Yan Lu. Empirical Research on Topic Drift Index for Trending Network Events[J]. 数据分析与知识发现, 2020, 4(11): 92-101.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn