|
|
Text Classification Method for Urban Portrait Based on Multi-Label Annotation Learning |
Ye Guanghui(),Li Songye,Song Xiaoying |
School of Information Management, Central China Normal University, Wuhan 430079, China |
|
|
Abstract [Objective] The study uses machine learning technology to analyze and obtain multi-labels for long social texts, aiming to provide new ideas for urban portrait text analysis and other related studies. It addresses the problems facing urban data portrait analysis, such as unstructured, different lengths, and non-singular topics in relevant analysis texts. [Methods] We retrieved social media texts on urban impressions from the Zhihu platform and performed sentence segmentation and noise reduction processing on the texts. Then, we manually annotated some texts using the existing urban portrait annotation framework. Next, we trained the support vector classification, convolutional neural networks, and Naive Bayesian and comprehensively evaluated their performance. We used the optimal model to obtain all labels for long texts, and utilized the ML-kNN multi-label learning model for training a multi-label social text classification model. [Results] Regarding the single-label text classification model, the support vector classification model had the best overall performance, with an accuracy rate of 0.690 0 for short text labeling. Using ML-kNN to build a multi-label text classification model, the highest accuracy rate reached 0.810 3, and the average Hamming loss was 0.035 3. [Limitations] The impact of textual context on topic classification needed to be fully considered. [Conclusions] Based on the long social text data on the Zhihu platform, the proposed multi-label classification model can effectively identify multiple labels for social long texts on the urban portrait.
|
Received: 30 June 2022
Published: 09 November 2022
|
|
Fund:National Natural Science Foundation of China(71804055) |
Corresponding Authors:
Ye Guanghui,ORCID:0000-0001-8111-5034,E-mail:3879-4081@163.com。
|
[1] |
Boutell M R, Luo J B, Shen X P, et al. Learning Multi-Label Scene Classification[J]. Pattern Recognition, 2004, 37(9): 1757-1771.
doi: 10.1016/j.patcog.2004.03.009
|
[2] |
Bogatinovski J, Todorovski L, Džeroski S, et al. Comprehensive Comparative Study of Multi-Label Classification Methods[OL]. arXiv Preprint, arXiv:2102.07113v2.
|
[3] |
Fürnkranz J, Hüllermeier E, Mencía E L, et al. Multilabel Classification via Calibrated Label Ranking[J]. Machine Learning, 2008, 73(2): 133-153.
doi: 10.1007/s10994-008-5064-8
|
[4] |
Gopal S, Yang Y M. Multilabel Classification with Meta-Level Features[C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2010: 315-322.
|
[5] |
Cambria E, Olsher D, Rajagopal D. SenticNet 3: A Common and Common-Sense Knowledge Base for Cognition-Driven Sentiment Analysis[C]// Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2014: 1515-1521.
|
[6] |
毕崇武, 叶光辉, 胡婧岚, 等. 城市画像视角下的热点城市特征识别方法研究[J]. 现代情报, 2020, 40(4): 13-22.
doi: 10.3969/j.issn.1008-0821.2020.04.002
|
[6] |
(Bi Chongwu, Ye Guanghui, Hu Jinglan, et al. Research on Discovery of the Focus of City Identity from the Perspective of City Profile[J]. Journal of Modern Information, 2020, 40(4): 13-22.)
doi: 10.3969/j.issn.1008-0821.2020.04.002
|
[7] |
叶光辉, 曾杰妍, 胡婧岚, 等. 城市画像视角下的社会公众情感演化研究[J]. 数据分析与知识发现, 2020, 4(4): 15-26.
|
[7] |
(Ye Guanghui, Zeng Jieyan, Hu Jinglan, et al. Analyzing Public Sentiments from the Perspective of City Profiles[J]. Data Analysis and Knowledge Discovery, 2020, 4(4): 15-26.)
|
[8] |
岳铁骐, 傅友斐, 徐健. 基于招聘广告的岗位人才需求分析框架构建与实证研究[J]. 数据分析与知识发现, 2022, 6(2/3): 151-166.
|
[8] |
(Yue Tieqi, Fu Youfei, Xu Jian. An Analysis Framework for Job Demands from Job Postings[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 151-166.)
|
[9] |
池毛毛, 潘美钰, 王伟军. 共享住宿与酒店用户评论文本的跨平台比较研究:基于LDA的主题社会网络和情感分析[J]. 图书情报工作, 2021, 65(2): 107-116.
doi: 10.13266/j.issn.0252-3116.2021.02.011
|
[9] |
(Chi Maomao, Pan Meiyu, Wang Weijun. A Cross-Platform Comparative Study of Reviews on Sharing Accommodation and Hotels Reservation Platform: Combined with LDA-SNA and Sentiment Analysis[J]. Library and Information Service, 2021, 65(2): 107-116.)
doi: 10.13266/j.issn.0252-3116.2021.02.011
|
[10] |
叶光辉, 王灿灿, 李松烨. 基于SciTS会议文本的跨学科科研协作新兴主题识别及预测[J]. 情报科学, 2022, 40(7): 126-135.
|
[10] |
(Ye Guanghui, Wang Cancan, Li Songye. Recognition and Prediction of Emerging Topics in Interdisciplinary Scientific Research Collaboration Based on SciTS Conference Text[J]. Information Science, 2022, 40(7): 126-135.)
|
[11] |
郝超, 裘杭萍, 孙毅, 等. 多标签文本分类研究进展[J]. 计算机工程与应用, 2021, 57(10): 48-56.
doi: 10.3778/j.issn.1002-8331.2101-0096
|
[11] |
(Hao Chao, Qiu Hangping, Sun Yi, et al. Research Progress of Multi-Label Text Classification[J]. Computer Engineering and Applications, 2021, 57(10): 48-56.)
doi: 10.3778/j.issn.1002-8331.2101-0096
|
[12] |
陈胜远. 基于深度学习的面向多标签数据的文本分类方法研究[D]. 成都: 电子科技大学, 2021.
|
[12] |
(Chen Shengyuan. Research on Text Classification Method for Multi-Label Data Based on Deep Learning[D]. Chengdu: University of Electronic Science and Technology of China, 2021.)
|
[13] |
严玲, 周作建, 宋懿花, 等. 基于ML-kNN多标记学习的中医体质辨识模型研究[J]. 世界科学技术-中医药现代化, 2020, 22(10): 3558-3562.
|
[13] |
(Yan Ling, Zhou Zuojian, Song Yihua, et al. Study on the Identification Model of Traditional Chinese Medicine Constitutions Based on ML-kNN Multi-Label Learning[J]. Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology, 2020, 22(10): 3558-3562.)
|
[14] |
Wong C U I, Qi S S. Tracking the Evolution of a Destination′s Image by Text-Mining Online Reviews - The Case of Macau[J]. Tourism Management Perspectives, 2017, 23: 19-29.
doi: 10.1016/j.tmp.2017.03.009
|
[15] |
Liu L, Zhou B L, Zhao J H, et al. C-IMAGE: City Cognitive Mapping Through Geo-Tagged Photos[J]. GeoJournal, 2016, 81(6): 817-861.
doi: 10.1007/s10708-016-9739-6
|
[16] |
毕崇武, 叶光辉, 李明倩, 等. 基于标签语义挖掘的城市画像感知研究[J]. 数据分析与知识发现, 2019, 3(12): 41-51.
|
[16] |
(Bi Chongwu, Ye Guanghui, Li Mingqian, et al. Discovering City Profile Based on Tag Semantic Mining[J]. Data Analysis and Knowledge Discovery, 2019, 3(12): 41-51.)
|
[17] |
梁晨晨, 李仁杰. 综合LDA与特征维度的丽江古城意象感知分析[J]. 地理科学进展, 2020, 39(4): 614-626.
doi: 10.18306/dlkxjz.2020.04.008
|
[17] |
(Liang Chenchen, Li Renjie. Tourism Destination Image Perception Analysis Based on the Latent Dirichlet Allocation Model and Dominant Semantic Dimensions: A Case of the Old Town of Lijiang[J]. Progress in Geography, 2020, 39(4): 614-626.)
doi: 10.18306/dlkxjz.2020.04.008
|
[18] |
Peng X, Bao Y, Huang Z. Perceiving Beijing′s "City Image" Across Different Groups Based on Geotagged Social Media Data[J]. IEEE Access, 2020, 8: 93868-93881.
doi: 10.1109/Access.6287639
|
[19] |
李纲, 陈婧, 程明结, 等. 基于意见挖掘的城市形象网络监测系统初探[J]. 现代图书情报技术, 2010(2): 56-62.
|
[19] |
(Li Gang, Chen Jing, Cheng Mingjie, et al. Study on the City Image Network Monitoring System Based on Opinion-Mining[J]. New Technology of Library and Information Service, 2010(2): 56-62.)
|
[20] |
李尔尘. 浅谈城市形象识别[J]. 广东轻工职业技术学院学报, 2007, 6(1): 77-80.
|
[20] |
(Li Erchen. Study on the Identity of City Image[J]. Journal of Guangdong Industry Technical College, 2007, 6(1): 77-80.)
|
[21] |
王杨, 许闪闪, 李昌, 等. 基于支持向量机的中文极短文本分类模型[J]. 计算机应用研究, 2020, 37(2): 347-350.
|
[21] |
(Wang Yang, Xu Shanshan, Li Chang, et al. Classification Model Based on Support Vector Machine for Chinese Extremely Short Text[J]. Application Research of Computers, 2020, 37(2): 347-350.)
|
[22] |
张航. 基于朴素贝叶斯的中文文本分类及Python实现[D]. 济南: 山东师范大学, 2018.
|
[22] |
(Zhang Hang. Chinese Text Classification Based on Naive Bayes and Its Python Implementation[D]. Jinan: Shandong Normal University, 2018.)
|
[23] |
Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
|
[24] |
Zhang M L, Zhou Z H. ML-KNN: A Lazy Learning Approach to Multi-Label Learning[J]. Pattern Recognition, 2007, 40(7): 2038-2048.
doi: 10.1016/j.patcog.2006.12.019
|
[25] |
Dai L, Zhang J, Li C D, et al. Multi-Label Feature Selection with Application to TCM State Identification[J]. Concurrency and Computation: Practice and Experience, 2019, 31(23): e4634.
|
[26] |
Lin W Z, Fang J N, Xiao X, et al. ILoc-Animal: A Multi-Label Learning Classifier for Predicting Subcellular Localization of Animal Proteins[J]. Molecular BioSystems, 2013, 9(4): 634-644.
doi: 10.1039/c3mb25466f
|
[27] |
岳丽媛, 张增一. “PX”风险何以持续争议——基于微博和知乎文本的公众话语分析[J]. 自然辩证法通讯, 2019, 41(6): 85-91.
|
[27] |
(Yue Liyuan, Zhang Zengyi. Why the “PX” Continues to Cause Controversy: A Public Discourse Analysis of the Texts from Sina Micro-Blogs and the Q&A Website Zhihu[J]. Journal of Dialectics of Nature, 2019, 41(6): 85-91.)
|
[28] |
李昌兵, 赵玲, 李晓光, 等. 基于TF-IDF加权的卷积神经网络文本情感分类模型[J]. 重庆理工大学学报(自然科学), 2021, 35(11): 109-115.
|
[28] |
(Li Changbing, Zhao Ling, Li Xiaoguang, et al. Text Sentiment Classification Model Based on TF-IDF Weighted Convolutional Neural Network[J]. Journal of Chongqing University of Technology (Natural Science), 2021, 35(11): 109-115.)
|
[29] |
刘炜, 王旭, 张雨嘉, 等. 一种面向突发事件的文本语料自动标注方法[J]. 中文信息学报, 2017, 31(2): 76-85.
|
[29] |
(Liu Wei, Wang Xu, Zhang Yujia, et al. An Automatic-Annotation Method for Emergency Text Corpus[J]. Journal of Chinese Information Processing, 2017, 31(2): 76-85.)
|
[30] |
李济洪. 汉语框架语义角色的自动标注技术研究[D]. 太原: 山西大学, 2010.
|
[30] |
(Li Jihong. Research on Techniques of Automatic Sematic Role Labeling of Chinese FrameNet[D]. Taiyuan: Shanxi University, 2010.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|