Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (3): 1-8    DOI: 10.11925/infotech.2096-3467.2017.0849
Identifying Potential Customers Based on User-Generated Contents
Jiang Cuiqing(), Song Kailun, Ding Yong, Liu Yao
School of Management, Hefei University of Technology, Hefei 230009, China
[Objective] This paper aims to identify potential customers by analyzing user-generated contents from product-specific online forums. [Methods] First, we converted the unbalanced dataset into multiple balanced subsets. Then, we employed the Stacking classification algorithm to construct identification model. Finally, we compared results of the proposed method with five baseline algorithms. [Results] Compared to the algorithms of Bayesnet, Logistic, C4.5, SMO and Naive Bayes, the F-measure of our method was increased by 17.4%, 26.5%, 24.1%, 29.3%, and 40.9%. Compared to Stacking, Bagging and Boosting methods, our F-measure increased by 10.1%, 5.9%, 13.1%. [Limitations] We only examined performance of the proposed methods with automotive industry. [Conclusions] The proposed method could effectively identify potential customers based on user-generated contents.

Key wordsUser-Generated Content      Potential Customer Identification      Stacking Classification Algorithm      Imbalanced Datasets     
Received: 22 August 2017      Published: 03 April 2018
ZTFLH:  C931  

Cite this article:

Jiang Cuiqing,Song Kailun,Ding Yong,Liu Yao. Identifying Potential Customers Based on User-Generated Contents. Data Analysis and Knowledge Discovery, 2018, 2(3): 1-8.

特征 编号 说明 备注
人口统计学特征 F1-F14 用户是否所属某地区 是为1, 否为0
F15 用户的注册时长 注册时间到现在的时间差
F16 用户在论坛中的粉丝数
F17 用户在论坛中的关注数
F18 用户在论坛中的发帖精华数
文体特征 F19 评论内容中的总字数
F20-F26 评论内容中时间词、动词、形容词、副词、
F27-29 评论内容中句号、问号和叹号出现的频率 与NLPIR汉语分词包[30]中汉语词性标记集一致
情感特征 F30 评论内容的情感倾向是否为正面 与中文情感极性词典 NTUSD[23]一致, 是为1, 否为0
F31 评论内容的情感倾向是否为负面 与中文情感极性词典 NTUSD[23]一致, 是为1, 否为0
行为特征 F32 用户是否认证某车型 是为1, 否为0
F33 用户是否关注某车型 是为1, 否为0
F34 用户是否所属某车型组织 是为1, 否为0
F35 用户总评论数
F36 用户总发帖数
F37 用户回复时长 注册时间与回复时间的时间差
关键词特征 F38-F508 关键词出现的词频
算法 准确率 召回率 F值
本文算法 72.2% 70.3% 71.2%
贝叶斯网络 67.8% 44.5% 53.8%
逻辑回归 76.0% 31.7% 44.7%
决策树(C4.5) 55.3% 41.0% 47.1%
SMO 82.6% 28.1% 41.9%
朴素贝叶斯 18.9% 76.2% 30.3%
算法 准确率 召回率 F值
本文算法 72.2% 70.3% 71.2%
Stacking集成学习算法 57.8% 64.9% 61.1%
Bagging集成学习算法 65.8% 64.9% 65.3%
Boosting集成学习算法 55.6% 60.8% 58.1%
