基于超平面-BERT-Louvain优化LDA模型的书法作品价值要素提取及指标体系构建<sup>*</sup>

doi:10.11925/infotech.2096-3467.2022.0915

数据分析与知识发现

2023, Vol. 7

Issue (10): 109-118 https://doi.org/10.11925/infotech.2096-3467.2022.0915

研究论文

本期目录 | 过刊浏览 | 高级检索

基于超平面-BERT-Louvain优化LDA模型的书法作品价值要素提取及指标体系构建^*

潘小宇¹,倪渊^2,³(

),金春华²,张健^2,³

¹北京信息科技大学计算机学院北京 100192
²北京信息科技大学经济管理学院北京 100192
³绿色发展大数据决策北京市重点实验室北京 100192

Extracting Value Elements and Constructing Index System for Calligraphy Works Based on Hyperplane-BERT-Louvain Optimized LDA Model

Pan Xiaoyu¹,Ni Yuan^2,³(

),Jin Chunhua²,Zhang Jian^2,³

¹Computer School, Beijing Information Science & Technology University,Beijing 100192, China
²School of Economics & Management, Beijing Information Science & Technology University, Beijing 100192, China
³Beijing Key Laboratory of Big Data Decision Making for Green Development, Beijing 100192, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (907 KB) HTML ( 5 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】针对书法作品价值评估分歧大、标准缺失的难题，借助大数据与人工智能方法高效、准确地识别书法作品价值要素，为各种书法作品交易活动提供技术支撑。【方法】首先，融合超平面算法和BERT模型对书法文献进行停用词剔除与语义扩充，形成高辨识度的优化语料库；其次，构建书法文献复杂语义网络，引入Louvain算法，通过最大化社区网络的模块度确定最优主题数；最终，本文提出一种基于超平面-BERT-Louvain-LDA（HBL-LDA）的新方法来高效、准确地构建书法价值评估指标体系。【结果】相比于LDA，HBL-LDA模型的主题识别查准率和F值分别提高了45.00个百分点和29.46个百分点，平均主题优质率减少了0.96，识别的优质主题更多。基于代表性书法作品，利用多种回归模型对评估指标体系进行验证，准确率高达84.00%。【局限】 只针对书法作品构建了评估指标体系，难以在其他艺术品数据上适配。而且，BERT模型缺乏主题语义信息，使得相似特征词扩充具有一定的局限性。【结论】本文提出一种基于超平面-BERT-Louvain组合优化LDA模型，构建书法价值评估指标体系的新模式，为其他领域指标体系的构建提供了新方向。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	潘小宇
	倪渊
	金春华
	张健

关键词 ：价值评估指标体系, LDA, 领域停用词, Louvain, BERT

Abstract：

[Objective] This paper uses big data and artificial intelligence to identify the value elements of calligraphy works and provides technical support for their trading activities. It addresses the issue of lacking standards in the assessment of calligraphy works. [Methods] First, we combined the hyperplane algorithm and BERT model to preprocess calligraphy documents by eliminating stop words and expanding semantics to create an optimized corpus with high recognition. Secondly, we constructed a complex semantic network for calligraphy literature and introduced the Louvain algorithm to determine the optimal number of topics by maximizing the modularity of the community network. Finally, we developed a new method based on “Hyperplane-Bert-Louvain-LDA” (HBL-LDA) to construct an assessment index system of calligraphy value. [Results] Compared with LDA, the precision and F value of the topic recognition of the HBL-LDA were increased by 45.00% and 29.46%, respectively. The average topic quality rate was reduced by 0.96, with more high-quality topics identified. We also used regression models to verify the evaluation index system with representative calligraphy works, with the highest accuracy rate of 84.00%. [Limitations] This paper only constructed an evaluation system for calligraphy works, which cannot be applied to other artworks. The BERT model lacks the topic semantic information, which makes it challenging to expand similar feature words. [Conclusions] The new model for calligraphy value evaluation proposed in this paper provides new directions for constructing index systems in other fields.

Key words： Evaluation Index System LDA Field Stop Words Louvain BERT

收稿日期: 2022-08-29 出版日期: 2023-12-20

ZTFLH:	TP391
	G353

基金资助:*国家重点研发计划青年科学家项目(2021YFF0900200)

通讯作者: 倪渊，ORCID：0000-0002-0600-2619，E-mail： niyuan230@163.com。

引用本文:

潘小宇, 倪渊, 金春华, 张健. 基于超平面-BERT-Louvain优化LDA模型的书法作品价值要素提取及指标体系构建^*[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
Pan Xiaoyu, Ni Yuan, Jin Chunhua, Zhang Jian. Extracting Value Elements and Constructing Index System for Calligraphy Works Based on Hyperplane-BERT-Louvain Optimized LDA Model. Data Analysis and Knowledge Discovery, 2023, 7(10): 109-118.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0915 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I10/109

Table 1 书法价值评估指标体系汇总

Table 2 主题数确定、停用词过滤、语义增强的研究

Fig.1 书法价值评估指标体系构建和验证的整体框架

Fig.2 基于相似特征词扩充的示意图

Table 3 目标集和辅助集数据分布

模型	主题数	领域停用词过滤和相似特征词扩充	$T e x t r a c t$	$T c o r r e c t$	$T s t a n d a r d$	查准率/%	查全率/%	F值/%
LDA	55	无	32	8	11	25.00	72.72	37.20
HB-LDA	55	有	34	9	11	26.47	81.81	39.99
L-LDA	10	无	10	6	11	60.00	54.54	57.13
HBL-LDA	10	有	10	7	11	70.00	63.63	66.66

Table 4 不同模型抽取主题的结果

Table 5 LDA模型提取的前10个主题词和主题优质率

Table 6 HBL-LDA模型提取的前10个主题词和主题优质率

Table 7 书法价值评估指标体系

Table 8 回归模型预测的准确率

[1]	王玉卓, 闵华松. 基于毛笔建模的机器人书法系统[J]. 智能系统学报, 2021, 16(4): 707-716.
[1]	(Wang Yuzhuo, Min Huasong. Robot Calligraphy System Based on Brush Modeling[J]. CAAI Transactions on Intelligent Systems, 2021, 16(4): 707-716.)
[2]	吕行佳. 当代书法批评的认识误区及反思[J]. 艺术传播研究, 2021(1): 39-44.
[2]	(Lv Xingjia. Misunderstanding and Reflection on Contemporary Calligraphy Criticism[J]. Journal of Art Communication, 2021(1): 39-44.)
[3]	刘翔宇. 中国当代艺术品交易机制研究[D]. 济南: 山东大学, 2012.
[3]	(Liu Xiangyu. Research on the Trading Mechanism of Contemporary Art in China[D]. Jinan: Shandong University, 2012.)
[4]	祝帅. 关于当代书法评价体系建立方法的思考[J]. 美术观察, 2015(8): 26-27.
[4]	(Zhu Shuai. Reflections on the Establishment of Contemporary Calligraphy Evaluation System[J]. Art Observation, 2015(8): 26-27.)
[5]	俞琰, 赵乃瑄. 专利文本主题建模中领域停用词自动选取研究[J]. 图书情报工作, 2018, 62(11): 120-126. doi: 10.13266/j.issn.0252-3116.2018.11.014
[5]	(Yu Yan, Zhao Naixuan. Automatic Selection of Domain-Specific Stopwords in Topic Model of Patent Text[J]. Library and Information Service, 2018, 62(11): 120-126.) doi: 10.13266/j.issn.0252-3116.2018.11.014
[6]	Luhn H P. The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research and Development, 1958, 2(2): 159-165. doi: 10.1147/rd.22.0159
[7]	安毅, 张益富. “中国书法”评判体系构建初探[J]. 中国书法, 2016(14): 82-84.
[7]	(An Yi, Zhang Yifu. On the Construction of the Evaluation System of “China Calligraphy”[J]. Chinese Calligraphy, 2016(14): 82-84.)
[8]	张茜茜. 基于文本挖掘的企业技术创新指标体系构建方法研究[D]. 北京: 北京交通大学, 2021.
[8]	(Zhang Qianqian. Research on the Construction Method of Enterprise Technology Innovation Index System Based on Text Mining[D]. Beijing: Beijing Jiaotong University, 2021.)
[9]	冯坤, 杨强, 常馨怡, 等. 基于在线评论和随机占优准则的生鲜电商顾客满意度测评[J]. 中国管理科学, 2021, 29(2): 205-216.
[9]	(Feng Kun, Yang Qiang, Chang Xinyi, et al. Customer Satisfaction Evaluation Method for Fresh E-commerce Based on Online Reviews and Stochastic Dominance Rules[J]. Chinese Journal of Management Science, 2021, 29(2): 205-216.)
[10]	杜杏叶. 学术论文关键指标智能化评价研究[D]. 长春: 吉林大学, 2019.
[10]	(Du Xingye. Research on Intelligent Evaluation of Key Indicators of Academic Papers[D]. Changchun: Jilin University, 2019.)
[11]	徐选华, 侯宇舟, 何继善. 基于权威专家的不完全概率语言评价信息大群体决策方法及在干热岩勘探选址中的应用[J]. 运筹与管理, 2021, 30(8): 7-13. doi: 10.12005/orms.2021.0240
[11]	(Xu Xuanhua, Hou Yuzhou, He Jishan. Large Group Decision Making Method Based on Incomplete Probabilistic Linguistic Evaluation Information Considering Authoritative Expert and Its Application in Site Selection of Hot Dry Rock Exploration[J]. Operations Research and Management Science, 2021, 30(8): 7-13.) doi: 10.12005/orms.2021.0240
[12]	彭定洪, 饶宏伟. 含多重偏见的犹豫模糊群体决策方法[J]. 模糊系统与数学, 2022, 36(2): 49-59.
[12]	(Peng Dinghong, Rao Hongwei. Hesitant Fuzzy Group Decision Making Method with Multiple Biases[J]. Fuzzy Systems and Mathematics, 2022, 36(2): 49-59.)
[13]	刘佳琪. O2O外卖网站用户体验分析——以北京地区为例[D]. 北京: 首都经济贸易大学, 2018.
[13]	(Liu Jiaqi. A Study on User Experience of O2O Takeaway Website——Take Beijing as an Example[D]. Beijing: Capital University of Economics and Business, 2018.)
[14]	王莲, 李然, 徐笑非, 等. 地域文化产品造型多维评价模型[J]. 包装工程, 2021, 42(20): 389-394, 401.
[14]	(Wang Lian, Li Ran, Xu Xiaofei, et al. Multi-dimensional Evaluation Model of Regional Cultural Product Modeling[J]. Packaging Engineering, 2021, 42(20): 389-394, 401.)
[15]	王恒. 文化旅游偏好影响要素与优化导向——基于离散选择模型[J]. 社会科学家, 2022(1): 42-51.
[15]	(Wang Heng. Influencing Factors and Optimization Orientation of Cultural Tourism Preference—Based on Discrete Choice Model[J]. Social Scientist, 2022(1): 42-51.)
[16]	张奕韬, 万常选, 刘喜平, 等. 基于PSP_HDP主题模型的非结构化经济指标挖掘[J]. 软件学报, 2020, 31(3): 845-865.
[16]	(Zhang Yitao, Wan Changxuan, Liu Xiping, et al. Mining Unstructured Economic Indicators Based on PSP_HDP Topic Model[J]. Journal of Software, 2020, 31(3): 845-865.)
[17]	刘敬涛, 李秀霞, 邵作运. 一种基于主题提取和情感分析的图书评价方法[J]. 情报探索, 2022(5): 43-49.
[17]	(Liu Jingtao, Li Xiuxia, Shao Zuoyun. A Book Evaluation Method Based on Topic Extraction and Sentiment Analysis[J]. Information Research, 2022(5): 43-49.)
[18]	Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[19]	张志强, 王嘉逸, 魏明. 环境中的书法价值探颐[J]. 中国书法, 2014(5): 166-169.
[19]	(Zhang Zhiqiang, Wang Jiayi, Wei Ming. Exploring the Value of Calligraphy in the Environment[J]. Chinese Calligraphy, 2014(5): 166-169.)
[20]	赵长青. 初论书法价值及实现[J]. 中国书法, 2011(1): 39-40.
[20]	(Zhao Changqing. On the Value and Realization of Calligraphy[J]. Chinese Calligraphy, 2011(1): 39-40.)
[21]	李庶民. 当代书法的价值取向和发展方向[J]. 中国书法, 2020(9): 176-178.
[21]	(Li Shumin. Value Orientation and Development Direction of Contemporary Calligraphy[J]. Chinese Calligraphy, 2020(9): 176-178.)
[22]	陈振濂. 当代书法评价体系建设[M]. 第1版. 上海: 上海书画出版社, 2019: 113-191.
[22]	(Chen Zhenlian. Contemporary Calligraphy Evaluation System Construction[M]. Edition 1. Shanghai: Shanghai Calligraphy and Painting Publishing House, 2019: 113-191.)
[23]	黄琳, 王丽亚, 明新国. 基于改进的LDA模型的产品服务需求识别[J]. 工业工程与管理, 2023, 28(1): 42-50.
[23]	(Huang Lin, Wang Liya, Ming Xinguo. Product Service Requirement Identification Based on Modified-LDA Model[J]. Industrial Engineering and Management, 2023, 28(1): 42-50.)
[24]	关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[24]	(Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[25]	杨洋, 江开忠, 原明君, 等. 新闻话题识别中LDA最优主题数选取研究[J]. 数据分析与知识发现, 2022, 6(11): 72-78.
[25]	(Yang Yang, Jiang Kaizhong, Yuan Mingjun, et al. Selecting Optimal LDA Numbers to Identify News Topics[J]. Data Analysis and Knowledge Discovery, 2022, 6(11): 72-78.)
[26]	王婷婷, 韩满, 王宇. LDA模型的优化及其主题数量选择研究——以科技文献为例[J]. 数据分析与知识发现, 2018, 2(1) :29-40.
[26]	(Wang Tingting, Han Man, Wang Yu. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. Data Analysis and Knowledge Discovery, 2018, 2(1): 29-40.)
[27]	Ladani D J, Desai N P. Automatic Stopword Identification Technique for Gujarati Text[C]// Proceedings of 2021 International Conference on Artificial Intelligence and Machine Vision, Gandhinagar, India. Piscataway, NJ: IEEE, 2021: 1-5.
[28]	俞琰, 赵乃瑄. 基于辅助集的专利主题分析领域停用词选取[J]. 数据分析与知识发现, 2018, 2(11): 95-103.
[28]	(Yu Yan, Zhao Naixuan. Choosing Stopwords for Patent Topic Analysis Based on Auxiliary Set[J]. Data Analysis and Knowledge Discovery, 2018, 2(11): 95-103.)
[29]	Alshanik F, Apon A, Herzog A, et al. Accelerating Text Mining Using Domain-Specific Stop Word Lists[C]// Proceedings of 2020 IEEE International Conference on Big Data,Atlanta, GA, USA. Piscataway, NJ: IEEE, 2020: 2639-2648.
[30]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[31]	李菲菲. 基于C-LDA的教育领域搜索引擎的研究与实现[D]. 北京: 北京交通大学, 2018.
[31]	(Li Feifei. Research and Realization of the Search Engine in the Field of Education Based on C-LDA[D]. Beijing: Beijing Jiaotong University, 2018.)
[32]	Blondel V D, Guillaume J L, Lambiotte R, et al. Fast Unfolding of Communities in Large Networks[J]. Journal of Statistical Mechanics: Theory and Experiment, 2008, 2008(10): P10008.
[33]	徐进, 邓乐龄. 基于Louvain算法的铁路旅客社会网络社区划分研究[J]. 山东农业大学学报(自然科学版), 2018, 49(4): 722-725.
[33]	(Xu Jin, Deng Leling. Study on Community Detection of Railway Passenger Social Networks Based on Louvain Algorithm[J]. Journal of Shandong Agricultural University (Natural Science Edition), 2018, 49(4): 722-725.)
[34]	官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[34]	(Guanqin, Deng Sanhong, Wang Hao. Chinese Stopwords for Text Clustering: A Comparative Study[J]. Data Analysis and Knowledge Discovery, 2017, 1(3): 72-80.)
[35]	张曦元. 基于LDA的博文分类及主题演化研究[D]. 沈阳: 东北大学, 2019.
[35]	(Zhang Xiyuan. Research on Classification and Topic Evolution of Blog Based on LDA[D]. Shenyang: Northeastern University, 2019.)

[1]	贺超城, 黄茜, 李欣儒, 王春迎, 吴江. 元宇宙的冷与热——融合BERT与动态主题模型的微博文本分析^*[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2]	赵雪峰, 吴德林, 吴伟伟, 孙卓荦, 胡瑾瑾, 廉莹, 单佳宇. 基于深度学习与多分类轮询机制的高质量“卡脖子”技术专利识别模型——以专利申请文件为研究主体*[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3]	张振青, 孙巍. 基于特征测度和PhraseLDA模型的领域学科交叉主题识别研究——以纳米技术的农业环境应用领域为例^*[J]. 数据分析与知识发现, 2023, 7(7): 32-45.
[4]	本妍妍, 庞雪芹. 融入词性的医疗命名实体识别研究^*[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[5]	徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法^*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[6]	苏明星, 吴厚月, 李健, 黄菊, 张顺香. 基于多层交互注意力机制的商品属性抽取^*[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[7]	赵一鸣, 潘沛, 毛进. 基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究^*[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[8]	王宇飞, 张智雄, 赵旸, 张梦婷, 李雪思. 中文科技论文标题自动生成系统的设计与实现^*[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[9]	张思阳, 魏苏波, 孙争艳, 张顺香, 朱广丽, 吴厚月. 基于多标签Seq2Seq模型的情绪-原因对提取模型^*[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[10]	李楠, 汪波. 跨学科语义漂移识别与可视化分析^*[J]. 数据分析与知识发现, 2023, 7(10): 15-24.
[11]	施运梅, 袁博, 张乐, 吕学强. IMTS：融合图像与文本语义的虚假评论检测方法*[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[12]	李慧, 胡吉霞, 佟志颖. 面向多源数据的学科主题挖掘与演化分析^*[J]. 数据分析与知识发现, 2022, 6(7): 44-55.
[13]	吴江, 刘涛, 刘洋. 在线社区用户画像及自我呈现主题挖掘——以网易云音乐社区为例^*[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[14]	郑洁, 黄辉, 秦永彬. 一种融合法律知识的相似案例匹配模型^*[J]. 数据分析与知识发现, 2022, 6(7): 99-106.
[15]	潘慧萍, 李宝安, 张乐, 吕学强. 基于多特征融合的政府工作报告关键词提取研究*[J]. 数据分析与知识发现, 2022, 6(5): 54-63.

Viewed

Full text

Abstract

Cited

Shared

Discussed