Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (5): 60-70    DOI: 10.11925/infotech.2096-3467.2022.0673
Current Issue | Archive | Adv Search |
Text Classification Method for Urban Portrait Based on Multi-Label Annotation Learning
Ye Guanghui(),Li Songye,Song Xiaoying
School of Information Management, Central China Normal University, Wuhan 430079, China
Download: PDF (825 KB)   HTML ( 15
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] The study uses machine learning technology to analyze and obtain multi-labels for long social texts, aiming to provide new ideas for urban portrait text analysis and other related studies. It addresses the problems facing urban data portrait analysis, such as unstructured, different lengths, and non-singular topics in relevant analysis texts. [Methods] We retrieved social media texts on urban impressions from the Zhihu platform and performed sentence segmentation and noise reduction processing on the texts. Then, we manually annotated some texts using the existing urban portrait annotation framework. Next, we trained the support vector classification, convolutional neural networks, and Naive Bayesian and comprehensively evaluated their performance. We used the optimal model to obtain all labels for long texts, and utilized the ML-kNN multi-label learning model for training a multi-label social text classification model. [Results] Regarding the single-label text classification model, the support vector classification model had the best overall performance, with an accuracy rate of 0.690 0 for short text labeling. Using ML-kNN to build a multi-label text classification model, the highest accuracy rate reached 0.810 3, and the average Hamming loss was 0.035 3. [Limitations] The impact of textual context on topic classification needed to be fully considered. [Conclusions] Based on the long social text data on the Zhihu platform, the proposed multi-label classification model can effectively identify multiple labels for social long texts on the urban portrait.

Key wordsMulti-Label      City Image      Social Text      Text Classification      ML-kNN     
Received: 30 June 2022      Published: 09 November 2022
ZTFLH:  G350  
Fund:National Natural Science Foundation of China(71804055)
Corresponding Authors: Ye Guanghui,ORCID:0000-0001-8111-5034,E-mail:3879-4081@163.com。   

Cite this article:

Ye Guanghui, Li Songye, Song Xiaoying. Text Classification Method for Urban Portrait Based on Multi-Label Annotation Learning. Data Analysis and Knowledge Discovery, 2023, 7(5): 60-70.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0673     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I5/60

一级 二级 三级
生态 气候状况 气候变化、季节时长、天气温度、天气湿度
污染状况 空气质量、水质
地理环境 地理位置、自然风光、自然灾害、自然资源
文化 历史文化 历史地位、文化底蕴、名胜古迹、文化产业
语言特色 方言特色、普通话特色
饮食特色 特色美食、饮食习惯、食物口味、特色餐厅、食物价格
名人 /
民风 /
特产 /
社会 教育 名校、教育水平、科研水平
娱乐 旅游资源、娱乐场所、娱乐态度、城市景点
人口 人口特征、人口构成、人口素质、人口布局、人口数量
交通 交通设置、司机特点、交通状况
居住感受 生活节奏、宜居程度、生活气息、幸福指数
利民服务 /
医疗水平 /
市容市貌 /
整体 城市发展 发展速度、发展状况、发展前景
城市规划 绿地规划、街道布局、景区规划、商圈规划
吸引力 /
感情 /
总体评价 /
政府服务 /
城市地位 /
占地面积 /
包容性 /
经济 经济水平 /
物价 /
房价 /
收入 /
就业 /
贫富差距 /
City Portrait Annotation Frame
Research Framework
索引 原索引 分句后结果
0 1 在武汉四年,谈谈对武汉的印象。
1 1 第一印象是大,正所谓大江大河大武汉,我记得曾看过一个中国城市面积的评比,武汉市的面积当属全国第一。
2 1 (还记得刚上大学的时候,我用了两三天的时间打我看大多数知名景点逛完,逛完只想哭。)
3 1 第二个印象是挤。
Result Display After Clause
方法 Accuracy Precision Recall F1-Score Support
SVC 0.740 5 噪声数据 0.732 4 0.568 6 0.640 2 2 849
非噪声数据 0.744 2 0.857 9 0.797 0 4 167
macro avg 0.738 3 0.713 3 0.718 6 7 016
weighted avg 0.739 4 0.740 5 0.733 3 7 016
CNN 0.726 5 噪声数据 0.668 7 0.646 9 0.657 6 2 849
非噪声数据 0.763 8 0.780 9 0.772 3 4 167
macro avg 0.716 3 0.713 9 0.715 0 7 016
weighted avg 0.725 2 0.726 5 0.725 7 7 016
Naive Bayes 0.658 2 噪声数据 0.623 3 0.400 1 0.487 4 2 849
非噪声数据 0.670 5 0.834 7 0.743 6 4 167
macro avg 0.646 9 0.617 4 0.615 5 7 016
weighted avg 0.651 3 0.658 2 0.639 6 7 016
Comparison of Noise Reduction Effect of Three Models
Training Results of Neural Network Model
转换后 转换前
交通 交通
娱乐 娱乐
整体 地理环境、人口、总体评价、感情、城市地位、居住感受、城市发展、占地面积、吸引力、包容性、城市规划
文化 历史文化、名人、民风、特产
服务 政府服务、利民服务、医疗水平、市容市貌、教育
气候 气候状况
污染 污染
经济 经济水平、物价、房价、收入、就业、贫富差距
语言 语言特色
饮食 饮食文化
Transformation Relationship of Label Content
Training Results of Convolutional Neural Networks
Training Results of Naive Bayes
Training Results of Support Vector Classification Model
Evaluation with Different Nearest Neighbors
[1] Boutell M R, Luo J B, Shen X P, et al. Learning Multi-Label Scene Classification[J]. Pattern Recognition, 2004, 37(9): 1757-1771.
doi: 10.1016/j.patcog.2004.03.009
[2] Bogatinovski J, Todorovski L, Džeroski S, et al. Comprehensive Comparative Study of Multi-Label Classification Methods[OL]. arXiv Preprint, arXiv:2102.07113v2.
[3] Fürnkranz J, Hüllermeier E, Mencía E L, et al. Multilabel Classification via Calibrated Label Ranking[J]. Machine Learning, 2008, 73(2): 133-153.
doi: 10.1007/s10994-008-5064-8
[4] Gopal S, Yang Y M. Multilabel Classification with Meta-Level Features[C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2010: 315-322.
[5] Cambria E, Olsher D, Rajagopal D. SenticNet 3: A Common and Common-Sense Knowledge Base for Cognition-Driven Sentiment Analysis[C]// Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2014: 1515-1521.
[6] 毕崇武, 叶光辉, 胡婧岚, 等. 城市画像视角下的热点城市特征识别方法研究[J]. 现代情报, 2020, 40(4): 13-22.
doi: 10.3969/j.issn.1008-0821.2020.04.002
[6] (Bi Chongwu, Ye Guanghui, Hu Jinglan, et al. Research on Discovery of the Focus of City Identity from the Perspective of City Profile[J]. Journal of Modern Information, 2020, 40(4): 13-22.)
doi: 10.3969/j.issn.1008-0821.2020.04.002
[7] 叶光辉, 曾杰妍, 胡婧岚, 等. 城市画像视角下的社会公众情感演化研究[J]. 数据分析与知识发现, 2020, 4(4): 15-26.
[7] (Ye Guanghui, Zeng Jieyan, Hu Jinglan, et al. Analyzing Public Sentiments from the Perspective of City Profiles[J]. Data Analysis and Knowledge Discovery, 2020, 4(4): 15-26.)
[8] 岳铁骐, 傅友斐, 徐健. 基于招聘广告的岗位人才需求分析框架构建与实证研究[J]. 数据分析与知识发现, 2022, 6(2/3): 151-166.
[8] (Yue Tieqi, Fu Youfei, Xu Jian. An Analysis Framework for Job Demands from Job Postings[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 151-166.)
[9] 池毛毛, 潘美钰, 王伟军. 共享住宿与酒店用户评论文本的跨平台比较研究:基于LDA的主题社会网络和情感分析[J]. 图书情报工作, 2021, 65(2): 107-116.
doi: 10.13266/j.issn.0252-3116.2021.02.011
[9] (Chi Maomao, Pan Meiyu, Wang Weijun. A Cross-Platform Comparative Study of Reviews on Sharing Accommodation and Hotels Reservation Platform: Combined with LDA-SNA and Sentiment Analysis[J]. Library and Information Service, 2021, 65(2): 107-116.)
doi: 10.13266/j.issn.0252-3116.2021.02.011
[10] 叶光辉, 王灿灿, 李松烨. 基于SciTS会议文本的跨学科科研协作新兴主题识别及预测[J]. 情报科学, 2022, 40(7): 126-135.
[10] (Ye Guanghui, Wang Cancan, Li Songye. Recognition and Prediction of Emerging Topics in Interdisciplinary Scientific Research Collaboration Based on SciTS Conference Text[J]. Information Science, 2022, 40(7): 126-135.)
[11] 郝超, 裘杭萍, 孙毅, 等. 多标签文本分类研究进展[J]. 计算机工程与应用, 2021, 57(10): 48-56.
doi: 10.3778/j.issn.1002-8331.2101-0096
[11] (Hao Chao, Qiu Hangping, Sun Yi, et al. Research Progress of Multi-Label Text Classification[J]. Computer Engineering and Applications, 2021, 57(10): 48-56.)
doi: 10.3778/j.issn.1002-8331.2101-0096
[12] 陈胜远. 基于深度学习的面向多标签数据的文本分类方法研究[D]. 成都: 电子科技大学, 2021.
[12] (Chen Shengyuan. Research on Text Classification Method for Multi-Label Data Based on Deep Learning[D]. Chengdu: University of Electronic Science and Technology of China, 2021.)
[13] 严玲, 周作建, 宋懿花, 等. 基于ML-kNN多标记学习的中医体质辨识模型研究[J]. 世界科学技术-中医药现代化, 2020, 22(10): 3558-3562.
[13] (Yan Ling, Zhou Zuojian, Song Yihua, et al. Study on the Identification Model of Traditional Chinese Medicine Constitutions Based on ML-kNN Multi-Label Learning[J]. Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology, 2020, 22(10): 3558-3562.)
[14] Wong C U I, Qi S S. Tracking the Evolution of a Destination′s Image by Text-Mining Online Reviews - The Case of Macau[J]. Tourism Management Perspectives, 2017, 23: 19-29.
doi: 10.1016/j.tmp.2017.03.009
[15] Liu L, Zhou B L, Zhao J H, et al. C-IMAGE: City Cognitive Mapping Through Geo-Tagged Photos[J]. GeoJournal, 2016, 81(6): 817-861.
doi: 10.1007/s10708-016-9739-6
[16] 毕崇武, 叶光辉, 李明倩, 等. 基于标签语义挖掘的城市画像感知研究[J]. 数据分析与知识发现, 2019, 3(12): 41-51.
[16] (Bi Chongwu, Ye Guanghui, Li Mingqian, et al. Discovering City Profile Based on Tag Semantic Mining[J]. Data Analysis and Knowledge Discovery, 2019, 3(12): 41-51.)
[17] 梁晨晨, 李仁杰. 综合LDA与特征维度的丽江古城意象感知分析[J]. 地理科学进展, 2020, 39(4): 614-626.
doi: 10.18306/dlkxjz.2020.04.008
[17] (Liang Chenchen, Li Renjie. Tourism Destination Image Perception Analysis Based on the Latent Dirichlet Allocation Model and Dominant Semantic Dimensions: A Case of the Old Town of Lijiang[J]. Progress in Geography, 2020, 39(4): 614-626.)
doi: 10.18306/dlkxjz.2020.04.008
[18] Peng X, Bao Y, Huang Z. Perceiving Beijing′s "City Image" Across Different Groups Based on Geotagged Social Media Data[J]. IEEE Access, 2020, 8: 93868-93881.
doi: 10.1109/Access.6287639
[19] 李纲, 陈婧, 程明结, 等. 基于意见挖掘的城市形象网络监测系统初探[J]. 现代图书情报技术, 2010(2): 56-62.
[19] (Li Gang, Chen Jing, Cheng Mingjie, et al. Study on the City Image Network Monitoring System Based on Opinion-Mining[J]. New Technology of Library and Information Service, 2010(2): 56-62.)
[20] 李尔尘. 浅谈城市形象识别[J]. 广东轻工职业技术学院学报, 2007, 6(1): 77-80.
[20] (Li Erchen. Study on the Identity of City Image[J]. Journal of Guangdong Industry Technical College, 2007, 6(1): 77-80.)
[21] 王杨, 许闪闪, 李昌, 等. 基于支持向量机的中文极短文本分类模型[J]. 计算机应用研究, 2020, 37(2): 347-350.
[21] (Wang Yang, Xu Shanshan, Li Chang, et al. Classification Model Based on Support Vector Machine for Chinese Extremely Short Text[J]. Application Research of Computers, 2020, 37(2): 347-350.)
[22] 张航. 基于朴素贝叶斯的中文文本分类及Python实现[D]. 济南: 山东师范大学, 2018.
[22] (Zhang Hang. Chinese Text Classification Based on Naive Bayes and Its Python Implementation[D]. Jinan: Shandong Normal University, 2018.)
[23] Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[24] Zhang M L, Zhou Z H. ML-KNN: A Lazy Learning Approach to Multi-Label Learning[J]. Pattern Recognition, 2007, 40(7): 2038-2048.
doi: 10.1016/j.patcog.2006.12.019
[25] Dai L, Zhang J, Li C D, et al. Multi-Label Feature Selection with Application to TCM State Identification[J]. Concurrency and Computation: Practice and Experience, 2019, 31(23): e4634.
[26] Lin W Z, Fang J N, Xiao X, et al. ILoc-Animal: A Multi-Label Learning Classifier for Predicting Subcellular Localization of Animal Proteins[J]. Molecular BioSystems, 2013, 9(4): 634-644.
doi: 10.1039/c3mb25466f
[27] 岳丽媛, 张增一. “PX”风险何以持续争议——基于微博和知乎文本的公众话语分析[J]. 自然辩证法通讯, 2019, 41(6): 85-91.
[27] (Yue Liyuan, Zhang Zengyi. Why the “PX” Continues to Cause Controversy: A Public Discourse Analysis of the Texts from Sina Micro-Blogs and the Q&A Website Zhihu[J]. Journal of Dialectics of Nature, 2019, 41(6): 85-91.)
[28] 李昌兵, 赵玲, 李晓光, 等. 基于TF-IDF加权的卷积神经网络文本情感分类模型[J]. 重庆理工大学学报(自然科学), 2021, 35(11): 109-115.
[28] (Li Changbing, Zhao Ling, Li Xiaoguang, et al. Text Sentiment Classification Model Based on TF-IDF Weighted Convolutional Neural Network[J]. Journal of Chongqing University of Technology (Natural Science), 2021, 35(11): 109-115.)
[29] 刘炜, 王旭, 张雨嘉, 等. 一种面向突发事件的文本语料自动标注方法[J]. 中文信息学报, 2017, 31(2): 76-85.
[29] (Liu Wei, Wang Xu, Zhang Yujia, et al. An Automatic-Annotation Method for Emergency Text Corpus[J]. Journal of Chinese Information Processing, 2017, 31(2): 76-85.)
[30] 李济洪. 汉语框架语义角色的自动标注技术研究[D]. 太原: 山西大学, 2010.
[30] (Li Jihong. Research on Techniques of Automatic Sematic Role Labeling of Chinese FrameNet[D]. Taiyuan: Shanxi University, 2010.)
[1] Zhang Siyang, Wei Subo, Sun Zhengyan, Zhang Shunxiang, Zhu Guangli, Wu Houyue. Extracting Emotion-Cause Pairs Based on Multi-Label Seq2Seq Model[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[2] Wang Weijun, Ning Zhiyuan, Du Yi, Zhou Yuanchun. Identifying Interdisciplinary Sci-Tech Literature Based on Multi-Label Classification[J]. 数据分析与知识发现, 2023, 7(1): 102-112.
[3] Wang jinzheng, Yang Ying, Yu Bengong. Classifying Customer Complaints Based on Multi-head Co-attention Mechanism[J]. 数据分析与知识发现, 2023, 7(1): 128-137.
[4] Ye Han,Sun Haichun,Li Xin,Jiao Kainan. Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[5] Tu Zhenchao, Ma Jing. Item Categorization Algorithm Based on Improved Text Representation[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[6] Chen Guo, Ye Chao. News Classification with Semi-Supervised and Active Learning[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[7] Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[8] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[9] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[10] Bai Simeng,Niu Zhendong,He Hui,Shi Kaize,Yi Kun,Ma Yuanchi. Biomedical Text Classification Method Based on Hypergraph Attention Network[J]. 数据分析与知识发现, 2022, 6(11): 13-24.
[11] Huang Xuejian, Liu Yuyang, Ma Tinghuai. Classification Model for Scholarly Articles Based on Improved Graph Neural Network[J]. 数据分析与知识发现, 2022, 6(10): 93-102.
[12] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[13] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[14] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[15] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn