Generating AND-OR Logical Expressions for Semantic Features of Categorical Documents
Xu Zheng,Le Xiaoqiu()
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
[Objective] The paper represents category unit of the categorical document as an AND-OR logical expression with semantic features, which provides data for category semantic matching and retrieval. [Methods] We constructed the seq2seq generation model using UniLM based on the AND-OR logical semantic annotation of category unit descriptions. This model learns the speech features and explicit AND-OR logical text features, to improve the sorting strategy of Beam Search. The proposed method could generate AND-OR logical expression of semantic features within category unit. By integrating context-level semantics, we extended the external semantics of category unit. [Results] We examined our method with the manually annotated International Patent Classification data. The evaluation score of the experimental result was 87.2 points, which was 11.5 points higher than the benchmark model (BiLSTM-Attention). [Limitations] More research is needed to examine the model’s performance with other datasets. [Conclusions] The proposed semantic representation method could effectively generate AND-OR logical expressions for patent data, which integrates the internal semantic features of category unit and the semantic features at the contextual level.
( Gao Jieyun, Zhao Fengyu, Liu Ya. Text Classification of Modified Hybrid Feature Selection Based on Semantic Enhancement[J]. Computer Technology and Development, 2021,31(1):24-29.)
( Zhang Aimin, Jia Junzhi, Hao Qianqian. The Study on Automatic Mapping of Category Between Chinese Library Classification and DDC[J]. New Technology of Library and Information Service, 2014(7):17-23.)
( Cheng Jinxiang, Zhang Zhongyue, Cao Miao, et al. Taxonomy Construction and Machine Indexing Strategies of Fishery Patent Literature[J]. Journal of Library and Information Science in Agricultural, 2020,32(7):63-72.)
( Yuan Man, Ouyang Yuanxin, Xiong Zhang, et al. Short Text Feature Extension Method Based on Frequent Term Sets[J]. Journal of Southeast University (Natural Science Edition), 2014,44(2):256-260.)
[8]
江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
[8]
( Jiang Dapeng. Research on Short Text Classification Based on Word Distributed Representation[D]. Hangzhou: Zhejiang University, 2015.)
[9]
方东昊. 基于LDA的微博短文本分类技术的研究与实现[D]. 沈阳: 东北大学, 2011.
[9]
( Fang Donghao. Study and Implementation of Microblog’s Short Text Classification Based on LDA[D]. Shenyang: Northeastern University, 2011.)
( Tian Chuang, Zhao Yajuan. A Similarity-based Model for Mapping Between Patent and Industrial Classifications——Mapping Between the International Patent Classification and the Industrial Classification for National Economic Activities[J]. Library and Information Service, 2016,60(20):123-131.)
( Ma Xiaomeng, Xu Feng, Liu Qingmin, et al. Doc2vec-based Study on Mapping Between Patented and Industrial Categories[J]. Information Research, 2020(6):67-74.)
[12]
Pang B, Lee L, Vaithyanathan S. Thumbs up? Sentiment Classification Using Machine Learning Techniques[C]// Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. 2002: 79-86.
[13]
Sundararaman D, Subramanian V, Wang G Y, et al. Syntax-Infused Transformer and BERT Models for Machine Translation and Natural Language Understanding[OL]. arXiv Preprint, arXiv: 1911. 06156.
( Shang Hai, Luo Senlin, Han Lei, et al. Research on Short Text Representation Based on Sentential Semantic Components[J]. Netinfo Security, 2016(5):64-70.)
[15]
Mnih V, Heess N, Graves A, et al. Recurrent Models of Visual Attention[OL]. arXiv Preprint, arXiv: 1406. 6247.
[16]
岳永政. 基于特征表示的中文极短文本分类方法研究[D]. 合肥: 合肥工业大学, 2020.
[16]
( Yue Yongzheng. Research on Classification Method on Chinese Short Texts with Few Words Based on Feature Representation[D]. Hefei: Hefei University of Technology, 2020.)
( Zhang Hongke, Fu Zhenxin, Ren Qianping, et al. Automated ICD Coding Based on Word Embedding with Entry Embedding and Attention Mechanism[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020,56(1):1-8.)
[18]
Dong L, Lapata M. Language to Logical Form with Neural Attention[OL]. arXiv Preprint, arXiv: 1601. 01280.
[19]
张强. 基于机器翻译的中文语义解析[D]. 南京: 东南大学, 2015.
[19]
( Zhang Qiang. Chinese Semantic Parsing Based on Machine Translation[D]. Nanjing: Southeast University, 2015.)
[20]
Dong L, Yang N, Wang W H, et al. Unified Language Model Pre-training for Natural Language Understanding and Generation[OL]. arXiv Preprint, arXiv: 1905. 03197.
[21]
Sundararaman D, Subramanian V, Wang G Y, et al. Carin Syntax-Infused Transformer and BERT Models for Machine Translation and Natural Language Understanding[OL]. arXiv Preprint, arXiv: 1911. 06156.
[22]
Papineni K, Roukos S, Ward T, et al. BLEU: A Method for Automatic Evaluation of Machine Translation[C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2002: 311-318.