Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (5): 95-103     https://doi.org/10.11925/infotech.2096-3467.2021.0023
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
类目式文档语义特征AND-OR逻辑表达式生成方法
徐峥,乐小虬()
中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
Generating AND-OR Logical Expressions for Semantic Features of Categorical Documents
Xu Zheng,Le Xiaoqiu()
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (1047 KB)   HTML ( 8
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 将类目式文档中的类目单元表示成语义特征AND-OR逻辑表达式,使类目文档实现语义化表示,为类目语义匹配、语义检索等应用提供语义化数据。【方法】 以类目单元描述/注释文本AND-OR逻辑语义标注数据为基础,利用UniLM模型,通过学习词性特征、显式AND-OR逻辑文本描述特征以及改进Beam Search搜索排序策略等方法构建Seq2Seq生成模型,解决类目单元内语义特征AND-OR逻辑表达式的生成问题。通过融合上下文层次语义,解决类目单元外部语义的扩展问题。【结果】 在人工标注的国际专利分类表数据上展开实验,结果评价得分为87.2分,比基准模型(BiLSTM-Attention)高11.5分。【局限】 适用于国际专利分类表中的类目数据特点,其泛化效果有待在其他领域数据中进一步验证。【结论】 所提类目单元语义表示方法在国际专利分类表中有较好表现,能够有效生成融合类目单元内部语义特征及其上下文层次语义特征的AND-OR逻辑表达式。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐峥
乐小虬
关键词 语义表示语义解析AND-OR逻辑类目式文档    
Abstract

[Objective] The paper represents category unit of the categorical document as an AND-OR logical expression with semantic features, which provides data for category semantic matching and retrieval. [Methods] We constructed the seq2seq generation model using UniLM based on the AND-OR logical semantic annotation of category unit descriptions. This model learns the speech features and explicit AND-OR logical text features, to improve the sorting strategy of Beam Search. The proposed method could generate AND-OR logical expression of semantic features within category unit. By integrating context-level semantics, we extended the external semantics of category unit. [Results] We examined our method with the manually annotated International Patent Classification data. The evaluation score of the experimental result was 87.2 points, which was 11.5 points higher than the benchmark model (BiLSTM-Attention). [Limitations] More research is needed to examine the model’s performance with other datasets. [Conclusions] The proposed semantic representation method could effectively generate AND-OR logical expressions for patent data, which integrates the internal semantic features of category unit and the semantic features at the contextual level.

Key wordsSemantic Representation    Semantic Parsing    AND-OR Logic    Categorical Document
收稿日期: 2021-01-10      出版日期: 2021-05-27
ZTFLH:  TP391  
通讯作者: 乐小虬     E-mail: lexq@mail.las.ac.cn
引用本文:   
徐峥,乐小虬. 类目式文档语义特征AND-OR逻辑表达式生成方法[J]. 数据分析与知识发现, 2021, 5(5): 95-103.
Xu Zheng,Le Xiaoqiu. Generating AND-OR Logical Expressions for Semantic Features of Categorical Documents. Data Analysis and Knowledge Discovery, 2021, 5(5): 95-103.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0023      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I5/95
Fig.1  类目单元语义特征逻辑表达式
类目 注释 AND-OR逻辑组合特征
E01C 21/02 现场熔化、煅烧或焙烧土壤 现场AND(熔化OR煅烧OR焙烧)AND土壤
E02B 7/16 固定堰;其上部结构或闸板 固定堰OR(固定堰AND(上部结构OR闸板))
Table 1  类目句AND-OR逻辑组合特征
Fig.2  技术路线
Fig.3  UniLM模型Seq2Seq Mask训练机制[20]
Fig.4  特征融合
Fig.5  层次关系语义构建结果
参数 取值
Batch Size 8
Learning Rate 10-5
hidden_act GELU
隐藏层单元数 768
hidden_dropout_prob 0.1
文本截断长度 128
字向量维度 768
词性向量维度 768/2
显式语法逻辑特征向量维度 768/2
Beam Search 3
Table 2  模型实验参数配置
模型 得分
BiLSTM+Attention 75.7
BiLSTM+CNN 76.1
BERT-Seq2Seq 83.4
本文模型 87.2
Table 3  模型得分结果
类目注释 BiLSTM+Attention BiLSTM+CNN BERT-Seq2Seq 本文模型
缘饰;装修条 缘饰 OR 装修条 缘饰 OR 装修条 缘 饰 OR 装 修 条 ( 缘 饰 OR 装 修 条 )
装纳公用管线用的 装纳 AND 公用 AND 管道 装纳 AND 公用 AND 管道 装 卸 AND 公 用 AND 管 道 装 纳 AND 公 用 AND 管 道
清除道碴;所用设备 ( 清除 OR 道碴 ) AND ( ( 清除 OR 测量 ) AND ( AND ( 清除 AND 道碴) OR ( ( 清除 AND 道碴 ) AND ) ) ( 清 除 AND 道碴) OR ( ( 清 除 AND 道 碴 ) AND 设 备 ) ) ( 清 除 AND 道碴) OR ( ( 清 除 AND 道 碴 ) AND 设 备 ) )
Table 4  模型生成结果实例
[1] 王丽杰. 汉语语义依存分析研究[D]. 哈尔滨: 哈尔滨工业大学, 2010.
[1] ( Wang Lijie. Research on Chinese Semantic Dependency Analysis[D]. Harbin: Harbin Institute of Technology, 2010.)
[2] 乔秀明. 基于词粒度知识迁移的依存句法分析研究[D]. 哈尔滨: 哈尔滨工业大学, 2020.
[2] ( Qiao Xiuming. Research on Transfer of Dependency Parsing Based on Lexical-level Knowledge[D]. Harbin: Harbin Institute of Technology, 2020.)
[3] Robertson S. Understanding Inverse Document Frequency: On Theoretical Arguments for IDF[J]. Journal of Documentation, 2004,6(5):503-520.
[4] 高洁云, 赵逢禹, 刘亚. 基于语义增强的改进混合特征选择的文本分类[J]. 计算机技术与发展, 2021,31(1):24-29.
[4] ( Gao Jieyun, Zhao Fengyu, Liu Ya. Text Classification of Modified Hybrid Feature Selection Based on Semantic Enhancement[J]. Computer Technology and Development, 2021,31(1):24-29.)
[5] 张爱民, 贾君枝, 郝倩倩. 中图法与DDC类目自动映射研究[J]. 现代图书情报技术, 2014(7):17-23.
[5] ( Zhang Aimin, Jia Junzhi, Hao Qianqian. The Study on Automatic Mapping of Category Between Chinese Library Classification and DDC[J]. New Technology of Library and Information Service, 2014(7):17-23.)
[6] 程锦祥, 张钟月, 曹淼, 等. 渔业专利文献分类类目设置与机器标引策略研究[J]. 农业图书情报学报, 2020,32(7):63-72.
[6] ( Cheng Jinxiang, Zhang Zhongyue, Cao Miao, et al. Taxonomy Construction and Machine Indexing Strategies of Fishery Patent Literature[J]. Journal of Library and Information Science in Agricultural, 2020,32(7):63-72.)
[7] 袁满, 欧阳元新, 熊璋, 等. 一种基于频繁词集的短文本特征扩展方法[J]. 东南大学学报(自然科学版), 2014,44(2):256-260.
[7] ( Yuan Man, Ouyang Yuanxin, Xiong Zhang, et al. Short Text Feature Extension Method Based on Frequent Term Sets[J]. Journal of Southeast University (Natural Science Edition), 2014,44(2):256-260.)
[8] 江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
[8] ( Jiang Dapeng. Research on Short Text Classification Based on Word Distributed Representation[D]. Hangzhou: Zhejiang University, 2015.)
[9] 方东昊. 基于LDA的微博短文本分类技术的研究与实现[D]. 沈阳: 东北大学, 2011.
[9] ( Fang Donghao. Study and Implementation of Microblog’s Short Text Classification Based on LDA[D]. Shenyang: Northeastern University, 2011.)
[10] 田创, 赵亚娟. 一种基于相似度的专利与产业类目映射模型——以《国际专利分类》与《国民经济行业分类》为例[J]. 图书情报工作, 2016,60(20):123-131.
[10] ( Tian Chuang, Zhao Yajuan. A Similarity-based Model for Mapping Between Patent and Industrial Classifications——Mapping Between the International Patent Classification and the Industrial Classification for National Economic Activities[J]. Library and Information Service, 2016,60(20):123-131.)
[11] 马晓萌, 徐峰, 刘清民, 等. 基于Doc2vec的专利与行业类目映射研究[J]. 情报探索, 2020 ( 6):67-74.
[11] ( Ma Xiaomeng, Xu Feng, Liu Qingmin, et al. Doc2vec-based Study on Mapping Between Patented and Industrial Categories[J]. Information Research, 2020(6):67-74.)
[12] Pang B, Lee L, Vaithyanathan S. Thumbs up? Sentiment Classification Using Machine Learning Techniques[C]// Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. 2002: 79-86.
[13] Sundararaman D, Subramanian V, Wang G Y, et al. Syntax-Infused Transformer and BERT Models for Machine Translation and Natural Language Understanding[OL]. arXiv Preprint, arXiv: 1911. 06156.
[14] 尚海, 罗森林, 韩磊, 等. 基于句义成分的短文本表示方法研究[J]. 信息网络安全, 2016(5):64-70.
[14] ( Shang Hai, Luo Senlin, Han Lei, et al. Research on Short Text Representation Based on Sentential Semantic Components[J]. Netinfo Security, 2016(5):64-70.)
[15] Mnih V, Heess N, Graves A, et al. Recurrent Models of Visual Attention[OL]. arXiv Preprint, arXiv: 1406. 6247.
[16] 岳永政. 基于特征表示的中文极短文本分类方法研究[D]. 合肥: 合肥工业大学, 2020.
[16] ( Yue Yongzheng. Research on Classification Method on Chinese Short Texts with Few Words Based on Feature Representation[D]. Hefei: Hefei University of Technology, 2020.)
[17] 张虹科, 付振新, 任前平, 等. 基于融合条目词嵌入和注意力机制的自动ICD编码[J]. 北京大学学报(自然科学版), 2020,56(1):1-8.
[17] ( Zhang Hongke, Fu Zhenxin, Ren Qianping, et al. Automated ICD Coding Based on Word Embedding with Entry Embedding and Attention Mechanism[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020,56(1):1-8.)
[18] Dong L, Lapata M. Language to Logical Form with Neural Attention[OL]. arXiv Preprint, arXiv: 1601. 01280.
[19] 张强. 基于机器翻译的中文语义解析[D]. 南京: 东南大学, 2015.
[19] ( Zhang Qiang. Chinese Semantic Parsing Based on Machine Translation[D]. Nanjing: Southeast University, 2015.)
[20] Dong L, Yang N, Wang W H, et al. Unified Language Model Pre-training for Natural Language Understanding and Generation[OL]. arXiv Preprint, arXiv: 1905. 03197.
[21] Sundararaman D, Subramanian V, Wang G Y, et al. Carin Syntax-Infused Transformer and BERT Models for Machine Translation and Natural Language Understanding[OL]. arXiv Preprint, arXiv: 1911. 06156.
[22] Papineni K, Roukos S, Ward T, et al. BLEU: A Method for Automatic Evaluation of Machine Translation[C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2002: 311-318.
[1] 李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[2] 张金柱,主立鹏,刘菁婕. 基于表示学习的无监督跨语言专利推荐研究*[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[3] 傅柱,王曰芬,丁绪辉. 面向知识重用的设计过程知识语义表示研究*[J]. 数据分析与知识发现, 2019, 3(6): 21-29.
[4] 花舒宇,吴静,王娟娟. 互补结构网络智能代理机制研究*[J]. 现代图书情报技术, 2007, 2(7): 68-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn