基本情况 主编致辞 收录获奖
 编委会 编辑部 审稿专家
 本刊学术规范 行业规范

## 基于文本分类的政府网站信箱自动转递方法研究*

1南京大学信息管理学院 南京 210023

2南京大学政务数据资源研究所 南京 210023

## Automatic Transferring Government Website E-Mails Based on Text Classification

Wang Sidi1,2, Hu Guangwei,,1,2, Yang Siyu1,2, Shi Yun1

1School of Information Management, Nanjing University, Nanjing 210023, China

2Government Data Resources Institution of Nanjing University, Nanjing 210023, China

 基金资助: *本文系国家自然科学基金面上项目“电子政务服务价值共创机制及实现模式实证研究”的研究成果之一。.  71573117

【目的】 为改善政府网站领导信箱传统人工转递方式存在的人力、时间成本较高以及工作人员负担较重等问题,研究网站来信的自动转递方法。【方法】 选择较有代表性的分类算法,包括朴素贝叶斯、决策树、随机森林以及多层神经网络,对北京、合肥和深圳的市长信箱文本数据进行对比实验,进而设计一套基于文本分类的政府网站信箱自动转递方法,并给出相应的应用建议。【结果】 神经网络算法在市长信箱文本的分类表现最优,宏平均精确度和召回率均达0.85以上,且所有微平均指标均达0.93以上;朴素贝叶斯算法次之;随机森林算法的宏平均精确度很高,但召回率较差;决策树算法的精确度和召回率都较一般。【局限】 未能兼顾来信数量不均衡对结果的影响,且实验时剔除了数据量过小的部门的来信数据,这在实际应用中可能会存在一定偏差。【结论】 本文设计的政府网站信箱自动转递方法能够优化领导信箱运作机制,对提升线上政民互动效率,降低人力及行政成本具有积极意义。

Abstract

[Objective] This research proposes a method to automatically transferring e-mails received by government websites, aiming to reduce labor costs of managing public email boxes. [Methods] First, we chose four representative classification algorithms, including Naïve Bayes, Decision Tree, Random Forest and Multi-Layer Perception, and compared their classification resutls of e-mails received by the websites of Mayor’s Offices in Beijing, Hefei and Shenzhen. Then, we designed a method of automatically transferring these emails. Finally, we gave suggestions on the application of our method in the real world settings. [Results] Multi-Layer Perception yielded the best performance in our study, with the macro average precision and recall reaching more than 0.85, and all micro average indicators reaching more than 0.93. Naïve Bayes took the second place. Random Forest had a high macro average precision, but poor recall score. Decision Tree had an average precision and recall results. [Limitations] We did not examine the impacts of skewed distribution of received emails and eliminated the departments receiving few emails. [Conclusions] The proposed method optimizes the operation of public e-mails, which improves the efficiency of online government and reduces administrative costs.

Keywords： Leader’s Mailbox ; Automatic Transfer ; Text Classification ; Multi-Layer Perception ; Process Optimization

Wang Sidi. Automatic Transferring Government Website E-Mails Based on Text Classification. Data Analysis and Knowledge Discovery[J], 2020, 4(6): 51-59 doi:10.11925/infotech.2096-3467.2019.1182

（1） 特征选择及文本表示方法

（2） 分类算法

## 4 实验及结果分析

Table 1  Dataset

### 图1

Fig.1   Experimental Procedure

### 4.3 实验结果及分析

（1） 分类效果评价指标

$Precision=TPTP+FP$
$Recall=TPTP+FN$
$F1=2×Precision×RecallPrecision+Recall$

$FPR=FPTN+FP$

（2） 总体分类结果分析

Table 2  Classification Performance

NBPrecision0.90850.87620.84700.95140.89850.9228
Recall0.90480.83680.82600.95140.89850.9228
F1值0.90350.85270.83230.95140.89850.9228
AUC0.99520.98900.98520.99670.99460.9941
DTPrecision0.82270.72220.73830.90520.83860.8697
Recall0.80370.70450.70170.90520.83860.8697
F1值0.81030.71120.71630.90520.83860.8697
AUC0.89850.84900.84870.94940.91620.9328
RFPrecision0.96210.94840.92040.93930.85900.9104
Recall0.78440.58800.67550.93930.85900.9104
F1值0.83960.66590.74630.93930.85900.9104
AUC0.99750.98860.99120.99690.99180.9958
MLPPrecision0.93670.91330.88280.96500.93470.9440
Recall0.91840.88930.85740.96500.93470.9440
F1值0.92560.89990.86790.96500.93470.9440
AUC0.99900.99500.99400.99950.99700.9975

### 图2

Fig.2   ROC Curve of Four Algorithms

（3） 部门分类结果分析——以北京市为例

### 图3

Fig.3   Classification Result of Four Algorithms

Table 3  Correlation Analysis Between the Number of Samples and Classification Result

Precision1.000 00.601 0**0.819 7***
Recall1.000 00.950 1***
F1值1.000 0

（注：***表示P<0.01（双尾）;**表示P<0.05（双尾）;*表示P<0.1（双尾）。）

## 5 政府网站信箱自动转递方法

### 图4

Fig.4   Automatic Transfer Process of the Mailbox on Government Website

### 5.2 自动转递方法应用建议

（1）对于历史信件数量较少的部门可结合部门职责信息构建部门分类特征词。实验发现,部门信件样本数量与分类效果间存在相关性,样本量较大的部门分类准确率也相对较高。这是由于样本量较大的部门可供算法学习的文本特征也较多。对于信件数过少的部门,可以结合部门职责信息提取特征词,丰富强化小样本特征。

（2）对于单一信件对应多部门问题可考虑设置分类概率阈值,将信件转递至多个部门。实验发现部分分类效果较差的部门是由于部门间职责存在交叉,从而增大误分率。在实际应用中还存在部分信件由多个部门进行回复的情况。因此,在应用领导信箱自动转递系统前,可对各部门职责进行梳理,厘清权责边界。此外,分类算法可输出样本属于所有类别的概率,系统可以设置阈值,当来信属于两个或多个类别的概率差值过小时,将信件转递至多个部门,从而降低误分的可能性。

## 6 结语

（1）神经网络算法最适合对领导信箱来信进行自动分类,分类的微平均精确度和召回率均达0.9以上。较高的分类准确率证明利用机器学习算法进行来信的部门自动识别是可行的。

（2）数据量较少的部门分类准确率要低于数据量大的部门。这说明随着样本量的增大,分类准确率会提高。

（3）个别部门间信件分类准确率较低的主要原因之一是由于部门间职责存在一定的交叉重叠。

## 支撑数据:

[1] 王思迪. 北京市长信箱数据.xlsx. 北京市长信箱原始数据.

[2] 王思迪. 合肥市长信箱数据.xlsx. 合肥市长信箱原始数据.

[3] 王思迪. 深圳市长信箱数据.xlsx. 深圳市长信箱原始数据.

[4] 王思迪. beijing.xlsx. 北京市长信箱预处理后数据.

[5] 王思迪. hefei.xlsx. 合肥市长信箱预处理后数据.

[6] 王思迪. shenzhen.xlsx. 深圳市长信箱预处理后数据.

## 参考文献 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

[J]. 电子政务, 2019(5):12-26.

( Sun Zongfeng, Zhao Xinghua.

A Study on the Interaction Between the Government and the People in the Internet-Based on the Big Data Analysis of the Mayor’s Mailbox of Qingdao

[J]. E-Government, 2019(5):12-26.)

“网络问政”中的回应性——对K市领导信箱的一个探索性研究

[J]. 长白学刊, 2018(2):65-74.

( Yu Junbo, Li Huilong, Yu Shuman.

Responsiveness in “Governing Online”—An Exploratory Study on K City’s Leader Mailbox

[J]. Changbai Journal, 2018(2):65-74.)

[J]. 中国行政管理, 2016(2):6-9.

( Zheng Juntian, Gao Yuanying, Gu Qing.

Practice and Perfection of Local Governmental Administrative Power List System Construction

[J]. 现代电子技术, 2019,42(18):45-49.

( Wang Jun.

Research on Electronic Archive Automatic Classification System Based on Text Feature Recognition

[J]. Modern Electronics Technique, 2019,42(18):45-49.)

[J]. 图书情报知识, 2010(4):71-76.

( Li Xiangdong, Xu Peng, Huang Li, et al.

Research of Journals Manuscript Categorization Based on KNN Algorithm

[J]. Document, Information & Knowledge, 2010(4):71-76.)

[D]. 成都:电子科技大学, 2017.

( Li Chengming.

Research and Application of Talent Job Online Matching Based on Text Feature Extraction Technology

[D]. Chengdu:University of Electronic Science and Technology of China, 2017.)

[J]. 数据分析与知识发现, 2019,3(9):88-97.

( Wang Ruojia, Zhang Lu, Wang Jimin.

Automatic Triage of Online Doctor Services Based on Machine Learning

[J]. Data Analysis and Knowledge Discovery, 2019,3(9):88-97.)

Kim K, Zzang S Y.

Trigonometric Comparison Measure: A Feature Selection Method for Text Categorization

[J]. Data & Knowledge Engineering, DOI: 10.1016/j.datak.2018.10.003.

URL     PMID:21765568

Identifying time periods with a burst of activities related to a topic has been an important problem in analyzing time-stamped documents. In this paper, we propose an approach to extract a hot spot of a given topic in a time-stamped document set. Topics can be basic, containing a simple list of keywords, or complex. Logical relationships such as and, or, and not are used to build complex topics from basic topics. A concept of presence measure of a topic based on fuzzy set theory is introduced to compute the amount of information related to the topic in the document set. Each interval in the time period of the document set is associated with a numeric value which we call the discrepancy score. A high discrepancy score indicates that the documents in the time interval are more focused on the topic than those outside of the time interval. A hot spot of a given topic is defined as a time interval with the highest discrepancy score. We first describe a naive implementation for extracting hot spots. We then construct an algorithm called EHE (Efficient Hot Spot Extraction) using several efficient strategies to improve performance. We also introduce the notion of a topic DAG to facilitate an efficient computation of presence measures of complex topics. The proposed approach is illustrated by several experiments on a subset of the TDT-Pilot Corpus and DBLP conference data set. The experiments show that the proposed EHE algorithm significantly outperforms the naive one, and the extracted hot spots of given topics are meaningful.

Ghareb A S, Bakara A A Al-Radaideh Q A, et al.

Enhanced Filter Feature Selection Methods for Arabic Text Categorization

[J]. International Journal of Information Retrieval Research (IJIRR), 2018,8(2):1-24.

Hartmann J, Huppertz J, Schamp C, et al.

Comparing Automated Text Classification Methods

[J]. International Journal of Research in Marketing, 2019,36(1):20-38.

[J]. 北京信息科技大学学报(自然科学版), 2018,33(5):38-44.

( Tian Huan, Li Honglian, Lv Xueqiang, et al.

Text Categorization of Academic Activities Based on an Improved BP Neural Network

[J]. Journal of Beijing Information Science & Technology University, 2018,33(5):38-44.)

[J]. 数据分析与知识发现, 2018,2(3):30-38.

( Liu Liu, Wang Dongbo.

Identifying Interdisciplinary Social Science Research Based on Article Classification

[J]. Data Analysis and Knowledge Discovery, 2018,2(3):30-38.)

Gauld R, Flett J, McComb S, et al.

How Responsive are Government Agencies When Contacted by Email? Findings from a Longitudinal Study in Australia and New Zealand

[J]. Government Information Quarterly, 2016,33(2):283-290.

[J]. 电子政务, 2019(3):72-87.

( Li Huilong, Yu Junbo.

The Responsive Trap of Digital Government Governance-Based on the Investigation of “Message Board of Local Leaders” in Three Northeastern Provinces

[J]. E-Government, 2019(3):72-87.)

Ong C S, Wang S W.

Managing Citizen-Initiated Email Contacts

[J]. Government Information Quarterly, 2009,26(3):498-504.

## Abstract

Citizen-initiated contacts, often with requests for services or information, complaints or opinions, occupy a great portion of citizen involvement with local governments. The ease and low cost of emails opens a new agenda for the contacts. Governments not only have to provide convenient and friendly access points on the websites to receive these voices, but also have to respond to them in a timely and responsive way. Responsiveness invites more usage and imposes more caseloads on bureaucrats and governments as well. How can a government develop and manage an efficient, timely and responsive citizen-initiated email handling system? We conduct a longitudinal in-depth case study of Taipei City Mayor's Mailbox, a successful citizen-initiated email handling system existing for over 12 years. Through the study of the development process, we find the actors, humans and non-humans, and their interplays shape the Mailbox. Several important issues are identified including an evolutionary, incremental and emergent process, citizens' dissatisfaction as an actor, continuous involvement of Mayor and Commissioners, and engaging street-level bureaucrats. The study contributes greatly to understand the evolution of E-Government which is underspecified in the E-Government literature.

[J]. 光通信研究, 2005(3):44-46.

( Hu Jiani, Xu Weiran, Guo Jun, et al.

Study on Feature Selection Methods in Chinese Text Categorization

[J]. Study on Optical Communications, 2005(3):44-46.)

[J]. 计算机应用, 2013,33(6):1587-1590.

( Zhang Zhifei, Miao Duoqian, Gao Can.

Short Text Classification Using Latent Dirichlet Allocation

[J]. Journal of Computer Applications, 2013,33(6):1587-1590.)

Salton G, Wong A, Yang C S.

A Vector Space Model for Automatic Indexing

[J]. Communications of the ACM, 1975,18(11):613-620.

Manning C, Raghavan P, Schütze H.

Introduction to Information Retrieval

[M]. Cambridge University Press, 2008.

Breiman L.

Random Forests

[J]. Machine Learning, 2001,45(1):5-32.

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

Breiman L, Friedman J, Stone C J, et al.

Classification and Regression Trees

[M]. CRC Press, 1984.

Hinton G E.

Connectionist Learning Procedures

[J]. Artificial Intelligence, 1989,40(1-3):185-234.

/

 〈 〉

 版权所有 © 2015 《数据分析与知识发现》编辑部 地址：北京市海淀区中关村北四环西路33号 邮编：100190 电话/传真：(010)82626611-6626，82624938 E-mail:jishu@mail.las.ac.cn