[Objective] The study uses machine learning technology to analyze and obtain multi-labels for long social texts, aiming to provide new ideas for urban portrait text analysis and other related studies. It addresses the problems facing urban data portrait analysis, such as unstructured, different lengths, and non-singular topics in relevant analysis texts. [Methods] We retrieved social media texts on urban impressions from the Zhihu platform and performed sentence segmentation and noise reduction processing on the texts. Then, we manually annotated some texts using the existing urban portrait annotation framework. Next, we trained the support vector classification, convolutional neural networks, and Naive Bayesian and comprehensively evaluated their performance. We used the optimal model to obtain all labels for long texts, and utilized the ML-kNN multi-label learning model for training a multi-label social text classification model. [Results] Regarding the single-label text classification model, the support vector classification model had the best overall performance, with an accuracy rate of 0.690 0 for short text labeling. Using ML-kNN to build a multi-label text classification model, the highest accuracy rate reached 0.810 3, and the average Hamming loss was 0.035 3. [Limitations] The impact of textual context on topic classification needed to be fully considered. [Conclusions] Based on the long social text data on the Zhihu platform, the proposed multi-label classification model can effectively identify multiple labels for social long texts on the urban portrait.
叶光辉, 李松烨, 宋孝英. 基于多标签标注学习的城市画像文本分类方法研究*[J]. 数据分析与知识发现, 2023, 7(5): 60-70.
Ye Guanghui, Li Songye, Song Xiaoying. Text Classification Method for Urban Portrait Based on Multi-Label Annotation Learning. Data Analysis and Knowledge Discovery, 2023, 7(5): 60-70.
Boutell M R, Luo J B, Shen X P, et al. Learning Multi-Label Scene Classification[J]. Pattern Recognition, 2004, 37(9): 1757-1771.
doi: 10.1016/j.patcog.2004.03.009
[2]
Bogatinovski J, Todorovski L, Džeroski S, et al. Comprehensive Comparative Study of Multi-Label Classification Methods[OL]. arXiv Preprint, arXiv:2102.07113v2.
[3]
Fürnkranz J, Hüllermeier E, Mencía E L, et al. Multilabel Classification via Calibrated Label Ranking[J]. Machine Learning, 2008, 73(2): 133-153.
doi: 10.1007/s10994-008-5064-8
[4]
Gopal S, Yang Y M. Multilabel Classification with Meta-Level Features[C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2010: 315-322.
[5]
Cambria E, Olsher D, Rajagopal D. SenticNet 3: A Common and Common-Sense Knowledge Base for Cognition-Driven Sentiment Analysis[C]// Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2014: 1515-1521.
(Bi Chongwu, Ye Guanghui, Hu Jinglan, et al. Research on Discovery of the Focus of City Identity from the Perspective of City Profile[J]. Journal of Modern Information, 2020, 40(4): 13-22.)
doi: 10.3969/j.issn.1008-0821.2020.04.002
(Ye Guanghui, Zeng Jieyan, Hu Jinglan, et al. Analyzing Public Sentiments from the Perspective of City Profiles[J]. Data Analysis and Knowledge Discovery, 2020, 4(4): 15-26.)
(Yue Tieqi, Fu Youfei, Xu Jian. An Analysis Framework for Job Demands from Job Postings[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 151-166.)
(Chi Maomao, Pan Meiyu, Wang Weijun. A Cross-Platform Comparative Study of Reviews on Sharing Accommodation and Hotels Reservation Platform: Combined with LDA-SNA and Sentiment Analysis[J]. Library and Information Service, 2021, 65(2): 107-116.)
doi: 10.13266/j.issn.0252-3116.2021.02.011
(Ye Guanghui, Wang Cancan, Li Songye. Recognition and Prediction of Emerging Topics in Interdisciplinary Scientific Research Collaboration Based on SciTS Conference Text[J]. Information Science, 2022, 40(7): 126-135.)
(Hao Chao, Qiu Hangping, Sun Yi, et al. Research Progress of Multi-Label Text Classification[J]. Computer Engineering and Applications, 2021, 57(10): 48-56.)
doi: 10.3778/j.issn.1002-8331.2101-0096
(Chen Shengyuan. Research on Text Classification Method for Multi-Label Data Based on Deep Learning[D]. Chengdu: University of Electronic Science and Technology of China, 2021.)
(Yan Ling, Zhou Zuojian, Song Yihua, et al. Study on the Identification Model of Traditional Chinese Medicine Constitutions Based on ML-kNN Multi-Label Learning[J]. Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology, 2020, 22(10): 3558-3562.)
[14]
Wong C U I, Qi S S. Tracking the Evolution of a Destination′s Image by Text-Mining Online Reviews - The Case of Macau[J]. Tourism Management Perspectives, 2017, 23: 19-29.
doi: 10.1016/j.tmp.2017.03.009
[15]
Liu L, Zhou B L, Zhao J H, et al. C-IMAGE: City Cognitive Mapping Through Geo-Tagged Photos[J]. GeoJournal, 2016, 81(6): 817-861.
doi: 10.1007/s10708-016-9739-6
(Bi Chongwu, Ye Guanghui, Li Mingqian, et al. Discovering City Profile Based on Tag Semantic Mining[J]. Data Analysis and Knowledge Discovery, 2019, 3(12): 41-51.)
(Liang Chenchen, Li Renjie. Tourism Destination Image Perception Analysis Based on the Latent Dirichlet Allocation Model and Dominant Semantic Dimensions: A Case of the Old Town of Lijiang[J]. Progress in Geography, 2020, 39(4): 614-626.)
doi: 10.18306/dlkxjz.2020.04.008
[18]
Peng X, Bao Y, Huang Z. Perceiving Beijing′s "City Image" Across Different Groups Based on Geotagged Social Media Data[J]. IEEE Access, 2020, 8: 93868-93881.
doi: 10.1109/Access.6287639
(Li Gang, Chen Jing, Cheng Mingjie, et al. Study on the City Image Network Monitoring System Based on Opinion-Mining[J]. New Technology of Library and Information Service, 2010(2): 56-62.)
(Wang Yang, Xu Shanshan, Li Chang, et al. Classification Model Based on Support Vector Machine for Chinese Extremely Short Text[J]. Application Research of Computers, 2020, 37(2): 347-350.)
[22]
张航. 基于朴素贝叶斯的中文文本分类及Python实现[D]. 济南: 山东师范大学, 2018.
[22]
(Zhang Hang. Chinese Text Classification Based on Naive Bayes and Its Python Implementation[D]. Jinan: Shandong Normal University, 2018.)
[23]
Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[24]
Zhang M L, Zhou Z H. ML-KNN: A Lazy Learning Approach to Multi-Label Learning[J]. Pattern Recognition, 2007, 40(7): 2038-2048.
doi: 10.1016/j.patcog.2006.12.019
[25]
Dai L, Zhang J, Li C D, et al. Multi-Label Feature Selection with Application to TCM State Identification[J]. Concurrency and Computation: Practice and Experience, 2019, 31(23): e4634.
[26]
Lin W Z, Fang J N, Xiao X, et al. ILoc-Animal: A Multi-Label Learning Classifier for Predicting Subcellular Localization of Animal Proteins[J]. Molecular BioSystems, 2013, 9(4): 634-644.
doi: 10.1039/c3mb25466f
(Yue Liyuan, Zhang Zengyi. Why the “PX” Continues to Cause Controversy: A Public Discourse Analysis of the Texts from Sina Micro-Blogs and the Q&A Website Zhihu[J]. Journal of Dialectics of Nature, 2019, 41(6): 85-91.)
(Li Changbing, Zhao Ling, Li Xiaoguang, et al. Text Sentiment Classification Model Based on TF-IDF Weighted Convolutional Neural Network[J]. Journal of Chongqing University of Technology (Natural Science), 2021, 35(11): 109-115.)
(Liu Wei, Wang Xu, Zhang Yujia, et al. An Automatic-Annotation Method for Emergency Text Corpus[J]. Journal of Chinese Information Processing, 2017, 31(2): 76-85.)
[30]
李济洪. 汉语框架语义角色的自动标注技术研究[D]. 太原: 山西大学, 2010.
[30]
(Li Jihong. Research on Techniques of Automatic Sematic Role Labeling of Chinese FrameNet[D]. Taiyuan: Shanxi University, 2010.)