Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (5): 20-33    DOI: 10.11925/infotech.2096-3467.2021.0606
Current Issue | Archive | Adv Search |
Mining Policy Text Relevance with Syntactic Structure and Semantic Information
Wu Kaibiao,Lang Yuxiang,Dong Yu()
National Science Library, Chinese Academy of Sciences, Beijing 100190, China; Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (3556 KB)   HTML ( 26
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new method to analyze policy text relevance, aiming to retrieve more in-depth semantic information. [Methods] First, we built a new algorithm combining the dependency parsing analysis and word embedding model. Then, we analyzed the semantic relevance of policy texts from the perspective of sentence and word meaning information. Our method fully utilized the language characteristics of the policy texts to establish the extraction rules for dependency syntax. [Results] For test dataset with a relatively low degree of policy text association, our new algorithm’s F1 value reached 0.857, which was 22.78% higher than the algorithm fusing TF-IDF and cosine similarity. We also described policy text relevance with the subtle word differences. [Limitations] For semantic inforamiton mining, more research is needed to train word vector models for specific policy domains to further improve their accuracy. In sentence information mining, the accuracy of existing dependency syntactic analysis tools could be improved. [Conclusions] The proposed algorithm could effectively reveal the policy text association, as well as bring new research perspectives and tools for quantitative research on policy texts.

Key wordsPolicy Text Relevance      Dependency Parsing      Word Embedding     
Received: 20 June 2021      Published: 21 June 2022
ZTFLH:  D630  
  TP391  
Fund:Project of Literature and Information Capacity Building, Chinese Academy of Sciences(Y9290002)
Corresponding Authors: Dong Yu,ORCID:0000-0001-9006-5462     E-mail: dongy@mail.las.ac.cn

Cite this article:

Wu Kaibiao, Lang Yuxiang, Dong Yu. Mining Policy Text Relevance with Syntactic Structure and Semantic Information. Data Analysis and Knowledge Discovery, 2022, 6(5): 20-33.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0606     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I5/20

Experimental Design
Schematic Diagram of Dependency Parsing Structure
Classification of Test Sets for Policy Text Relevance Verification
Schematic Diagram of Policy Text Relevance Mining Process
Heat Map of Test Dataset Similarity
Algorithm Performance Trend with Similarity Threshold
关联计算方法 最优相似度值 P R F1值
本文方法 0.345 0.875 0.840 0.857
基于TF-IDF和余弦相似度的方法 0.265 0.633 0.780 0.698
Experimental Results Between the Proposed Method with Conventional Topic Model
Schematic Diagram of the Test Dataset Related with Word “Talent”
Schematic Diagram of Policy Text Association Mining Based on Comparison of Word Forms
[1] 黄萃. 政策文献量化研究[M]. 北京: 科学出版社, 2016.
[1] ( Huang Cui. Policy Documents Quantitative Research[M]. Beijing: Science Press, 2016.)
[2] 王海鑫. 基于关联网络的我国科技政策体系结构与变迁研究[D]. 北京: 清华大学, 2015.
[2] ( Wang Haixin. A Policy Relevance Network-based Study on S&T Policy in China: Structure and Evolution[D]. Beijing: Tsinghua University, 2015.)
[3] Gilardi F, Wüest B. Text-as-Data Methods for Comparative Policy Analysis[R]. University of Zurich, 2018.
[4] Watanabe K. Obstruction to Asian-Language Text Analysis[EB/OL]. (2018-06-30). [2021-09-19]. https://blog.koheiw.net/?p=766.
[5] 张汝昊. 基于语义和位置相似的作者共被引分析方法及效果实证[J]. 图书情报工作, 2020, 64(8): 111-124.
[5] ( Zhang Ruhao. Empirical Study of a Semantic and Proximity-Based Author Co-citation Analysis Method[J]. Library and Information Service, 2020, 64(8): 111-124.)
[6] 马费成, 李小宇, 张斌. 中国互联网内容监管体制结构、功能与演化分析[J]. 情报学报, 2013, 32(11): 1124-1137.
[6] ( Ma Feicheng, Li Xiaoyu, Zhang Bin. Analysis on the Structure, Function and Evolution of China’s Internet Content Regulation Regime[J]. Journal of the China Society for Scie.pngic and Technical Information, 2013, 32(11): 1124-1137.)
[7] 冯璐, 冷伏海. 共词分析方法理论进展[J]. 中国图书馆学报, 2006, 32(2): 88-92.
[7] ( Feng Lu, Leng Fuhai. Development of Theoretical Studies of Co-Word Analysis[J]. Journal of Library Science in China, 2006, 32(2): 88-92.)
[8] 黄萃, 赵培强, 李江. 基于共词分析的中国科技创新政策变迁量化分析[J]. 中国行政管理, 2015(9): 115-122.
[8] ( Huang Cui, Zhao Peiqiang, Li Jiang. Research on China’s Science and Technology Policy Changes Based on Co-word Cluster Analysis[J]. Chinese Public Administration, 2015(9): 115-122.)
[9] 郎玫. 大数据视野下中央与地方政府职能演变中的匹配度研究——基于甘肃省14市(州)政策文本主题模型(LDA)[J]. 情报杂志, 2018, 37(9): 78-85.
[9] ( Lang Mei. The Matching Degree Between Function of Local Government and Central Government under Big Data Perspective: A Research Based on the LDA Model of Gansu Province[J]. Journal of Intelligence, 2018, 37(9): 78-85.)
[10] 张涛, 马海群. 基于文本相似度计算的我国人工智能政策比较研究[J]. 情报杂志, 2021, 40(1): 39-47, 24.
[10] ( Zhang Tao, Ma Haiqun. Comparative Study on A.pngicial Intelligence Policies in China Based on Text Similarity Computation[J]. Journal of Intelligence, 2021, 40(1): 39-47, 24.)
[11] 刘河庆, 梁玉成. 政策内容再生产的影响机制——基于涉农政策文本的研究[J]. 社会学研究, 2021, 36(1): 115-136.
[11] ( Liu Heqing, Liang Yucheng. The Influence Mechanism of Policy Reproduction in China—A Study Based on Rural Policy Documents[J]. Sociological Studies, 2021, 36(1): 115-136.)
[12] 刘刚, 傅玮萍, 马莺歌. 基于语义的政策血缘网络演化机理研究[J]. 中文信息学报, 2018, 32(5):114-127.
[12] ( Liu Gang, Fu Weiping, Ma Yingge. Research on the Evolution Mechanism of Policy Blood Network Based on Semantic[J]. Journal of Chinese Information Processing, 2018, 32(5):114-127.)
[13] 马莺歌. 基于语义的政策血缘网络演化机理研究[D]. 哈尔滨: 哈尔滨工程大学, 2015.
[13] ( Ma Yingge. Research on the Evolution Mechanism of Policy Blood Network Based on Semantic[D]. Harbin: Harbin Engineering University, 2015.)
[14] 吴佐衍, 王宇. 基于HNC理论和依存句法的句子相似度计算[J]. 计算机工程与应用, 2014, 50(3): 97-102.
[14] ( Wu Zuoyan, Wang Yu. New Measure of Sentences Similarity Based on Hierarchical Network of Concepts Theory and Dependency Parsing[J]. Computer Engineering and Applications, 2014, 50(3): 97-102.)
[15] 李彬, 刘挺, 秦兵, 等. 基于语义依存的汉语句子相似度计算[J]. 计算机应用研究, 2003, 20(12): 15-17.
[15] ( Li Bin, Liu Ting, Qin Bing, et al. Chinese Sentence Similarity Computing Based on Semantic Dependency Relationship Analysis[J]. Application Research of Computers, 2003, 20(12): 15-17.)
[16] 邓涵, 朱新华, 李奇, 等. 基于句法结构与修饰词的句子相似度计算[J]. 计算机工程, 2017, 43(9): 240-244.
[16] ( Deng Han, Zhu Xinhua, Li Qi, et al. Sentence Similarity Calculation Based on Syntactic Structure and Modifier[J]. Computer Engineering, 2017, 43(9): 240-244.)
[17] 詹文青, 肖国华. 面向技术需求的潜在技术转移专利识别[J]. 情报理论与实践, 2019, 42(5): 117-121, 176.
[17] ( Zhan Wenqing, Xiao Guohua. Ide.pngy Potential Technology Transfer Patents Oriented Technology Demand[J]. Information Studies: Theory & Application, 2019, 42(5): 117-121, 176.)
[18] 邵卫, 化柏林. 基于依存句法分析的科技政策领域主题词表无监督构建[J]. 情报工程, 2020, 6(6): 33-44.
[18] ( Shao Wei, Hua Bolin. Unsupervised Construction of Thesaurus in the Science and Technology Policy Based on Dependency Syntax Analysis[J]. Technology Intelligence Engineering, 2020, 6(6): 33-44.)
[19] Mihalcea R, Corley C, Strapparava C. Corpus-Based and Knowledge-Based Measures of Text Semantic Similarity[C]// Proceedings of the 21st National Conference on A.pngicial Intelligence. 2006: 775-780.
[20] 来斯惟. 基于神经网络的词和文档语义向量表示方法研究[D]. 北京: 中国科学院大学, 2016.
[20] ( Lai Siwei. Word and Document Embeddings Based on Neural Network Approaches[D]. Beijing: University of Chinese Academy of Sciences, 2016.)
[21] Levenshtein V. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals[J]. Soviet Physics Doklady, 1965, 10: 707-710.
[22] Melamed I D. Automatic Evaluation and Uniform Filter Cascades for Inducing n-Best Translation Lexicons[OL]. arXiv Preprint, arXiv: cmp-lg/9505044.
[23] Kondrak G. N-gram Similarity and Distance[C]// Proceedings of International Symposium on String Processing and Information Retrieval.Springer, 2005: 115-126.
[24] Smith T F, Waterman M S. Ide.pngication of Common Molecular Subsequences[J]. Journal of Molecular Biology, 1981, 147(1): 195-197.
pmid: 7265238
[25] Wilkerson J, Smith D, Stramp N. Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach[J]. American Journal of Political Science, 2015, 59(4): 943-956.
doi: 10.1111/ajps.12175
[26] Linder F, Desmarais B, Burgess M, et al. Text as Policy: Measuring Policy Similarity Through Bill Text Reuse[J]. Policy Studies Journal, 2020, 48(2): 546-574.
doi: 10.1111/psj.12257
[27] Li S, Zhao Z, Hu R F, et al. Analogical Reasoning on Chinese Morphological and Semantic Relations[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 138-143.
[28] HIT-SCIR. HIT-SCIR/pyltp[EB/OL]. [2021-04-20]. https://github.com/HIT-SCIR/pyltp.
[29] Han H. HanLP: Han Language Processing[EB/OL]. [2021-04-20]. https://github.com/hankcs/HanLP.
[30] 朱新华, 马润聪, 孙柳, 等. 基于知网与词林的词语语义相似度计算[J]. 中文信息学报, 2016, 30(4): 29-36.
[30] ( Zhu Xinhua, Ma Runcong, Sun Liu, et al. Word Semantic Similarity Computation Based on HowNet and CiLin[J]. Journal of Chinese Information Processing, 2016, 30(4): 29-36.)
[31] 刘青磊, 顾小丰. 基于《知网》的词语相似度算法研究[J]. 中文信息学报, 2010, 24(6): 31-36.
[31] ( Liu Qinglei, Gu Xiaofeng. Study on HowNet-Based Word Similarity Algorithm[J]. Journal of Chinese Information Processing, 2010, 24(6): 31-36.)
[32] 新华社. 中共中央关于制定国民经济和社会发展第十四个五年规划和二〇三五年远景目标的建议[EB/OL].(2020-11-03). [2021-04-20]. http://www.gov.cn/zhengce/2020-11/03/content_5556991.htm.
[32] ( Xinhua News Agency. Proposals of the Central Committee of the Communist Party of China on Formulating the Fourteenth Five-Year Plan for National Economic and Social Development and the Long-term Goals for 2035[EB/OL]. (2020-11-03). [2021-04-20]. http://www.gov.cn/zhengce/2020-11/03/content_5556991.htm.)
[33] 广东省人民政府关于印发广东省新一代人工智能发展规划的通知[EB/OL]. (2018-08-10). [2021-04-20]. http://www.gd.gov.cn/gkmlpt/content/0/147/post_147108.html#7.
[33] ( Notice of the People’s Government of Guangdong Province on Issuing the Development Plan for the New Generation of A.pngicial Intelligence in Guangdong Province[EB/OL]. (2018-08-10). [2021-04-20]. http://www.gd.gov.cn/gkmlpt/content/0/147/post_147108.html#7.)
[34] 上海市人民政府办公厅印发《关于本市推动新一代人工智能发展的实施意见》的通知[EB/OL]. (2017-10-26). [2021-04-20]. https://www.shanghai.gov.cn/nw42639/20200823/0001-42639_54242.html.
[34] ( Notice of the General Office of the Shanghai Municipal People’s Government on Issuing the “Implementation Opinions on Promoting the Development of New Generation A.pngicial Intelligence”[EB/OL]. (2017-10-26). [2021-04-20]. https://www.shanghai.gov.cn/nw42639/20200823/0001-42639_54242.html.)
[1] Zhang Le, Leng Jidong, Lv Xueqiang, Yuan Menglong, You Xindong. Discovering Chinese New Words Based on Multi-sense Word Embedding[J]. 数据分析与知识发现, 2022, 6(1): 113-121.
[2] Fan Tao,Wang Hao,Wu Peng. Sentiment Analysis of Online Users' Negative Emotions Based on Graph Convolutional Network and Dependency Parsing[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[3] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[4] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[5] Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[6] Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[7] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[8] Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[9] Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[10] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[11] Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[12] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[13] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[14] Li Lin,Li Hui. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[15] Wang Tingting,Han Man,Wang Yu. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn