Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (5): 20-33     https://doi.org/10.11925/infotech.2096-3467.2021.0606
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
融合句法结构和词义信息的政策文本关联挖掘方法研究*
武楷彪,郎宇翔,董瑜()
中国科学院文献情报中心 北京 100190; 中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
Mining Policy Text Relevance with Syntactic Structure and Semantic Information
Wu Kaibiao,Lang Yuxiang,Dong Yu()
National Science Library, Chinese Academy of Sciences, Beijing 100190, China; Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (3556 KB)   HTML ( 31
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 进一步提高政策文本语义关联挖掘的深度,探索政策文本关联挖掘方法。【方法】 融合依存句法分析和词嵌入模型,分别从句子信息和词义信息角度挖掘政策文本内容深层次语义关联,且在设置依存句法抽取规则时充分考虑政策文本的用语特征。【结果】 在方法效果上,在政策文本关联程度区分相对较低的测试数据集中,所提方法F1值达到0.857,相较于融合TF-IDF和余弦相似度的算法,提升了22.78%;在方法功能上,可从文本用词的细微差异刻画政策文本关联。【局限】 在语义信息挖掘上,方法目前采用开源模型,后续可自主训练特定政策领域词向量模型以进一步提高准确度;在句子信息挖掘上,方法依赖于现有依存句法分析工具的准确性。【结论】 所提方法效果较好,功能较强,能有效揭示政策文本内容关联程度,可为政策文本量化研究提供新的研究视角和工具方法。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
武楷彪
郎宇翔
董瑜
关键词 政策文本关联依存句法分析词嵌入模型    
Abstract

[Objective] This paper proposes a new method to analyze policy text relevance, aiming to retrieve more in-depth semantic information. [Methods] First, we built a new algorithm combining the dependency parsing analysis and word embedding model. Then, we analyzed the semantic relevance of policy texts from the perspective of sentence and word meaning information. Our method fully utilized the language characteristics of the policy texts to establish the extraction rules for dependency syntax. [Results] For test dataset with a relatively low degree of policy text association, our new algorithm’s F1 value reached 0.857, which was 22.78% higher than the algorithm fusing TF-IDF and cosine similarity. We also described policy text relevance with the subtle word differences. [Limitations] For semantic inforamiton mining, more research is needed to train word vector models for specific policy domains to further improve their accuracy. In sentence information mining, the accuracy of existing dependency syntactic analysis tools could be improved. [Conclusions] The proposed algorithm could effectively reveal the policy text association, as well as bring new research perspectives and tools for quantitative research on policy texts.

Key wordsPolicy Text Relevance    Dependency Parsing    Word Embedding
收稿日期: 2021-06-20      出版日期: 2022-06-21
ZTFLH:  D630  
  TP391  
基金资助:*中国科学院文献情报能力建设专项的研究成果之一(Y9290002)
通讯作者: 董瑜,ORCID:0000-0001-9006-5462     E-mail: dongy@mail.las.ac.cn
引用本文:   
武楷彪, 郎宇翔, 董瑜. 融合句法结构和词义信息的政策文本关联挖掘方法研究*[J]. 数据分析与知识发现, 2022, 6(5): 20-33.
Wu Kaibiao, Lang Yuxiang, Dong Yu. Mining Policy Text Relevance with Syntactic Structure and Semantic Information. Data Analysis and Knowledge Discovery, 2022, 6(5): 20-33.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0606      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I5/20
Fig.1  整体实验方案设计
Fig.2  依存句法分析结构示意图
Fig.3  政策文本关联效果验证测试集类别划分
Fig.4  政策文本关联挖掘过程示意图
Fig.5  测试数据集相似度热力图
Fig.6  算法性能随相似度阈值变化趋势
关联计算方法 最优相似度值 P R F1值
本文方法 0.345 0.875 0.840 0.857
基于TF-IDF和余弦相似度的方法 0.265 0.633 0.780 0.698
Table 1  本文方法与常规主题模型方法对比实验结果
Fig.7  测试数据集通过“人才”一词的政策文本关联示意图
Fig.8  基于词形相同比较的政策文本关联挖掘示意图
[1] 黄萃. 政策文献量化研究[M]. 北京: 科学出版社, 2016.
[1] ( Huang Cui. Policy Documents Quantitative Research[M]. Beijing: Science Press, 2016.)
[2] 王海鑫. 基于关联网络的我国科技政策体系结构与变迁研究[D]. 北京: 清华大学, 2015.
[2] ( Wang Haixin. A Policy Relevance Network-based Study on S&T Policy in China: Structure and Evolution[D]. Beijing: Tsinghua University, 2015.)
[3] Gilardi F, Wüest B. Text-as-Data Methods for Comparative Policy Analysis[R]. University of Zurich, 2018.
[4] Watanabe K. Obstruction to Asian-Language Text Analysis[EB/OL]. (2018-06-30). [2021-09-19]. https://blog.koheiw.net/?p=766.
[5] 张汝昊. 基于语义和位置相似的作者共被引分析方法及效果实证[J]. 图书情报工作, 2020, 64(8): 111-124.
[5] ( Zhang Ruhao. Empirical Study of a Semantic and Proximity-Based Author Co-citation Analysis Method[J]. Library and Information Service, 2020, 64(8): 111-124.)
[6] 马费成, 李小宇, 张斌. 中国互联网内容监管体制结构、功能与演化分析[J]. 情报学报, 2013, 32(11): 1124-1137.
[6] ( Ma Feicheng, Li Xiaoyu, Zhang Bin. Analysis on the Structure, Function and Evolution of China’s Internet Content Regulation Regime[J]. Journal of the China Society for Scie.pngic and Technical Information, 2013, 32(11): 1124-1137.)
[7] 冯璐, 冷伏海. 共词分析方法理论进展[J]. 中国图书馆学报, 2006, 32(2): 88-92.
[7] ( Feng Lu, Leng Fuhai. Development of Theoretical Studies of Co-Word Analysis[J]. Journal of Library Science in China, 2006, 32(2): 88-92.)
[8] 黄萃, 赵培强, 李江. 基于共词分析的中国科技创新政策变迁量化分析[J]. 中国行政管理, 2015(9): 115-122.
[8] ( Huang Cui, Zhao Peiqiang, Li Jiang. Research on China’s Science and Technology Policy Changes Based on Co-word Cluster Analysis[J]. Chinese Public Administration, 2015(9): 115-122.)
[9] 郎玫. 大数据视野下中央与地方政府职能演变中的匹配度研究——基于甘肃省14市(州)政策文本主题模型(LDA)[J]. 情报杂志, 2018, 37(9): 78-85.
[9] ( Lang Mei. The Matching Degree Between Function of Local Government and Central Government under Big Data Perspective: A Research Based on the LDA Model of Gansu Province[J]. Journal of Intelligence, 2018, 37(9): 78-85.)
[10] 张涛, 马海群. 基于文本相似度计算的我国人工智能政策比较研究[J]. 情报杂志, 2021, 40(1): 39-47, 24.
[10] ( Zhang Tao, Ma Haiqun. Comparative Study on A.pngicial Intelligence Policies in China Based on Text Similarity Computation[J]. Journal of Intelligence, 2021, 40(1): 39-47, 24.)
[11] 刘河庆, 梁玉成. 政策内容再生产的影响机制——基于涉农政策文本的研究[J]. 社会学研究, 2021, 36(1): 115-136.
[11] ( Liu Heqing, Liang Yucheng. The Influence Mechanism of Policy Reproduction in China—A Study Based on Rural Policy Documents[J]. Sociological Studies, 2021, 36(1): 115-136.)
[12] 刘刚, 傅玮萍, 马莺歌. 基于语义的政策血缘网络演化机理研究[J]. 中文信息学报, 2018, 32(5):114-127.
[12] ( Liu Gang, Fu Weiping, Ma Yingge. Research on the Evolution Mechanism of Policy Blood Network Based on Semantic[J]. Journal of Chinese Information Processing, 2018, 32(5):114-127.)
[13] 马莺歌. 基于语义的政策血缘网络演化机理研究[D]. 哈尔滨: 哈尔滨工程大学, 2015.
[13] ( Ma Yingge. Research on the Evolution Mechanism of Policy Blood Network Based on Semantic[D]. Harbin: Harbin Engineering University, 2015.)
[14] 吴佐衍, 王宇. 基于HNC理论和依存句法的句子相似度计算[J]. 计算机工程与应用, 2014, 50(3): 97-102.
[14] ( Wu Zuoyan, Wang Yu. New Measure of Sentences Similarity Based on Hierarchical Network of Concepts Theory and Dependency Parsing[J]. Computer Engineering and Applications, 2014, 50(3): 97-102.)
[15] 李彬, 刘挺, 秦兵, 等. 基于语义依存的汉语句子相似度计算[J]. 计算机应用研究, 2003, 20(12): 15-17.
[15] ( Li Bin, Liu Ting, Qin Bing, et al. Chinese Sentence Similarity Computing Based on Semantic Dependency Relationship Analysis[J]. Application Research of Computers, 2003, 20(12): 15-17.)
[16] 邓涵, 朱新华, 李奇, 等. 基于句法结构与修饰词的句子相似度计算[J]. 计算机工程, 2017, 43(9): 240-244.
[16] ( Deng Han, Zhu Xinhua, Li Qi, et al. Sentence Similarity Calculation Based on Syntactic Structure and Modifier[J]. Computer Engineering, 2017, 43(9): 240-244.)
[17] 詹文青, 肖国华. 面向技术需求的潜在技术转移专利识别[J]. 情报理论与实践, 2019, 42(5): 117-121, 176.
[17] ( Zhan Wenqing, Xiao Guohua. Ide.pngy Potential Technology Transfer Patents Oriented Technology Demand[J]. Information Studies: Theory & Application, 2019, 42(5): 117-121, 176.)
[18] 邵卫, 化柏林. 基于依存句法分析的科技政策领域主题词表无监督构建[J]. 情报工程, 2020, 6(6): 33-44.
[18] ( Shao Wei, Hua Bolin. Unsupervised Construction of Thesaurus in the Science and Technology Policy Based on Dependency Syntax Analysis[J]. Technology Intelligence Engineering, 2020, 6(6): 33-44.)
[19] Mihalcea R, Corley C, Strapparava C. Corpus-Based and Knowledge-Based Measures of Text Semantic Similarity[C]// Proceedings of the 21st National Conference on A.pngicial Intelligence. 2006: 775-780.
[20] 来斯惟. 基于神经网络的词和文档语义向量表示方法研究[D]. 北京: 中国科学院大学, 2016.
[20] ( Lai Siwei. Word and Document Embeddings Based on Neural Network Approaches[D]. Beijing: University of Chinese Academy of Sciences, 2016.)
[21] Levenshtein V. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals[J]. Soviet Physics Doklady, 1965, 10: 707-710.
[22] Melamed I D. Automatic Evaluation and Uniform Filter Cascades for Inducing n-Best Translation Lexicons[OL]. arXiv Preprint, arXiv: cmp-lg/9505044.
[23] Kondrak G. N-gram Similarity and Distance[C]// Proceedings of International Symposium on String Processing and Information Retrieval.Springer, 2005: 115-126.
[24] Smith T F, Waterman M S. Ide.pngication of Common Molecular Subsequences[J]. Journal of Molecular Biology, 1981, 147(1): 195-197.
pmid: 7265238
[25] Wilkerson J, Smith D, Stramp N. Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach[J]. American Journal of Political Science, 2015, 59(4): 943-956.
doi: 10.1111/ajps.12175
[26] Linder F, Desmarais B, Burgess M, et al. Text as Policy: Measuring Policy Similarity Through Bill Text Reuse[J]. Policy Studies Journal, 2020, 48(2): 546-574.
doi: 10.1111/psj.12257
[27] Li S, Zhao Z, Hu R F, et al. Analogical Reasoning on Chinese Morphological and Semantic Relations[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 138-143.
[28] HIT-SCIR. HIT-SCIR/pyltp[EB/OL]. [2021-04-20]. https://github.com/HIT-SCIR/pyltp.
[29] Han H. HanLP: Han Language Processing[EB/OL]. [2021-04-20]. https://github.com/hankcs/HanLP.
[30] 朱新华, 马润聪, 孙柳, 等. 基于知网与词林的词语语义相似度计算[J]. 中文信息学报, 2016, 30(4): 29-36.
[30] ( Zhu Xinhua, Ma Runcong, Sun Liu, et al. Word Semantic Similarity Computation Based on HowNet and CiLin[J]. Journal of Chinese Information Processing, 2016, 30(4): 29-36.)
[31] 刘青磊, 顾小丰. 基于《知网》的词语相似度算法研究[J]. 中文信息学报, 2010, 24(6): 31-36.
[31] ( Liu Qinglei, Gu Xiaofeng. Study on HowNet-Based Word Similarity Algorithm[J]. Journal of Chinese Information Processing, 2010, 24(6): 31-36.)
[32] 新华社. 中共中央关于制定国民经济和社会发展第十四个五年规划和二〇三五年远景目标的建议[EB/OL].(2020-11-03). [2021-04-20]. http://www.gov.cn/zhengce/2020-11/03/content_5556991.htm.
[32] ( Xinhua News Agency. Proposals of the Central Committee of the Communist Party of China on Formulating the Fourteenth Five-Year Plan for National Economic and Social Development and the Long-term Goals for 2035[EB/OL]. (2020-11-03). [2021-04-20]. http://www.gov.cn/zhengce/2020-11/03/content_5556991.htm.)
[33] 广东省人民政府关于印发广东省新一代人工智能发展规划的通知[EB/OL]. (2018-08-10). [2021-04-20]. http://www.gd.gov.cn/gkmlpt/content/0/147/post_147108.html#7.
[33] ( Notice of the People’s Government of Guangdong Province on Issuing the Development Plan for the New Generation of A.pngicial Intelligence in Guangdong Province[EB/OL]. (2018-08-10). [2021-04-20]. http://www.gd.gov.cn/gkmlpt/content/0/147/post_147108.html#7.)
[34] 上海市人民政府办公厅印发《关于本市推动新一代人工智能发展的实施意见》的通知[EB/OL]. (2017-10-26). [2021-04-20]. https://www.shanghai.gov.cn/nw42639/20200823/0001-42639_54242.html.
[34] ( Notice of the General Office of the Shanghai Municipal People’s Government on Issuing the “Implementation Opinions on Promoting the Development of New Generation A.pngicial Intelligence”[EB/OL]. (2017-10-26). [2021-04-20]. https://www.shanghai.gov.cn/nw42639/20200823/0001-42639_54242.html.)
[1] 李博诚,张云秋,杨铠西. 面向微博商品评论的情感标签抽取研究 *[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[2] 李琳, 李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[3] 张帆, 乐小虬. 领域科技文献创新点句中主题属性实例识别方法研究[J]. 现代图书情报技术, 2015, 31(5): 15-23.
[4] 聂卉, 杜嘉忠. 依存句法模板下的商品特征标签抽取研究[J]. 现代图书情报技术, 2014, 30(12): 44-50.
[5] 唐晓波, 肖璐. 基于依存句法网络的文本特征提取研究[J]. 现代图书情报技术, 2014, 30(11): 31-37.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn