Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (6): 51-65     https://doi.org/10.11925/infotech.2096-3467.2020.1186
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于XGBoost的在线短租市场价格预测及特征分析模型*
曹睿1,廖彬1(),李敏1,2,孙瑞娜1,3,4
1新疆财经大学统计与数据科学学院 乌鲁木齐 830012
2新疆大学信息科学与工程学院 乌鲁木齐 830008
3中国科学院信息工程研究所 北京 100093
4中国科学院大学网络空间安全学院 北京 100093
Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost
Cao Rui1,Liao Bin1(),Li Min1,2,Sun Ruina1,3,4
1College of Statistics and Data Science, Xinjiang University of Finance & Economics, Urumqi 830012, China
2School of Information Science and Engineering, Xinjiang University, Urumqi 830008, China
3Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
4School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
全文: PDF (1982 KB)   HTML ( 28
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决不同特征的房源缺乏合理定价建议的问题。【方法】 基于Airbnb平台真实的营业数据,提出一种基于XGBoost的在线短租市场价格预测及特征分析模型。利用Lasso对原始数据进行特征提取并降维,再将特征提取后的数据作为XGBoost的输入,迭代训练获得最佳的预测模型,最后利用SHAP值对模型特征进行解释。【结果】 实验结果表明,基于XGBoost的在线短租市场价格预测模型在调优超参数后,RMSE、MAE和R-squared分别能够达到0.091、0.065和0.798,优于4种主要的对比模型。【局限】 由于数据源限制,模型训练数据未能与实时在线的业务数据流特征结合,可能导致模型实时适应能力偏弱。【结论】 引入SHAP模型增强模型的可解释性,综合XGBoost与RandomForest的特征重要性排序结果,识别出影响房价的关键因素,为房东改进服务质量并提高收益提供决策参考。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
曹睿
廖彬
李敏
孙瑞娜
关键词 机器学习定价模型在线短租XGBoost模型SHAP值    
Abstract

[Objective] This paper proposed a model to predict prices and analyze properties of online short-term rentals based on XGBoost, aiming to address the issue of lacking reasonable pricing suggestion mechanism for housing with different characteristics. [Methods] We collected data from the Airbnb platform and used Lasso to extract features from these raw data as well as reduced their dimensions. Then, we input the extracted data to XGBoost and iteratively trained the prediction model. Finally, we used the SHAP value to interpret the model features. [Results] The RMSE, MAE and R-squared values of the proposed model were 0.091, 0.065 and 0.798 respectively after tuning the hyperparameters, which were better than those of the four existing models. [Limitations] Our new model could not merge the features of real-time online business data, which influenced the prediction accuracy. [Conclusions] The proposed model has good interpretability, and could identify the key factors affecting housing prices, which helps the landlords improve services.

Key wordsMachine Learning    Pricing Model    Online Short-Term Rental    XGBoost Model    SHAP Value
收稿日期: 2020-11-29      出版日期: 2021-07-06
ZTFLH:  TP391  
基金资助:*国家自然科学基金项目(61562078);新疆天山青年计划项目(2018Q073)
通讯作者: 廖彬     E-mail: liaobin665@163.com
引用本文:   
曹睿,廖彬,李敏,孙瑞娜. 基于XGBoost的在线短租市场价格预测及特征分析模型*[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost. Data Analysis and Knowledge Discovery, 2021, 5(6): 51-65.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.1186      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I6/51
Fig.1  XGBoost建模流程
变量名称 变量解释 变量类型 变量名称 变量解释 变量类型
price 房源价格 数值型 review_scores_rating 评论分数等级 数值型
host_is_superhost 是否为超级房东 布尔型 review_scores_accuracy 如实描述得分 数值型
host_listings_count Airbnb.com上列出的房东的房源数量 数值型 review_scores_cleanliness 干净卫生得分 数值型
latitude 纬度 数值型 review_scores_checkin 入住顺利得分 数值型
longitude 经度 数值型 review_scores_communication 沟通交流得分 数值型
accommodates 可容纳人数 数值型 review_scores_location 位置便利得分 数值型
bathrooms 房源的浴室数量(间) 数值型 review_scores_value 高性价比得分 数值型
bedrooms 房源的卧室数量(间) 数值型 instant_bookable 能否即时预订 数值型
beds 房源的床数量(张) 数值型 reviews_per_month 月均评论数量 数值型
security_deposit 押金 数值型 extra_people_fee 额外费用 数值型
cleaning_fee 清洁费用 数值型 host_response_rate 房东回复速率 数值型
guests_included 房源实际的入住人数 数值型 host_acceptance_rate 房东接单速率 数值型
minimum_nights 房东要求的租户最少入住的天数 数值型 amenities 便捷设施 字符型
maximum_nights 房东要求的租户最多入住的天数 数值型 host_verifications 房东身份资料 字符型
availability_365 365天能提供天数 数值型 cancellation_policy 取消政策 字符型
number_of_reviews 评论数量 数值型
Table 1  Airbnb数据基本特征属性
变量名称 count mean std min 25% 75% max
price 37 048.000 227.916 685.160 0.000 69.000 185.000 25 000.000
host_is_superhost 37 048.000 0.324 0.468 0.000 0.000 1.000 1.000
accommodates 37 048.000 3.646 2.689 0.000 2.000 4.000 24.000
bathrooms 37 013.000 1.475 1.014 0.000 1.000 2.000 16.000
bedrooms 36 924.000 1.444 1.138 0.000 1.000 2.000 13.000
beds 36 667.000 1.969 1.679 0.000 1.000 2.000 50.000
security_deposit 37 048.000 372.586 2 231.724 0.000 0.000 300.000 250 000.000
cleaning_fee 37 048.000 83.825 100.025 0.000 20.000 109.000 2 500.000
guests_included 37 048.000 1.917 1.770 1.000 1.000 2.000 24.000
minimum_nights 37 048.000 12.715 26.759 1.000 1.000 30.000 1 125.000
maximum_nights 37 048.000 658.116 525.576 1.000 40.000 1125.000 10 004.000
availability_365 37 048.000 168.061 142.799 0.000 5.000 336.000 365.000
number_of_reviews 37 048.000 35.201 64.277 0.000 1.000 40.000 822.000
review_scores_rating 28 962.000 94.272 9.110 20.000 93.000 100.000 100.000
review_scores_accuracy 28 914.000 9.610 0.897 2.000 9.000 10.000 10.000
review_scores_cleanliness 28 915.000 9.418 1.011 2.000 9.000 10.000 10.000
review_scores_checkin 28 902.000 9.475 0.786 2.000 10.000 10.000 10.000
review_scores_communication 28 913.000 9.714 0.838 2.000 10.000 10.000 10.000
review_scores_location 28 898.000 9.707 0.730 2.000 9.000 10.000 10.000
review_scores_value 28 894.000 9.429 0.943 2.000 9.000 10.000 10.000
instant_bookable 37 048.000 0.432 0.495 0.000 0.000 1.000 1.000
reviews_per_month 29 413.000 1.605 1.750 0.010 0.300 2.410 17.230
extra_people_fee 37 048.000 0.507 0.499 0.000 0.000 1.000 1.000
host_response_rate 27 937.000 93.513 18.156 0.000 99.000 100.000 100.000
host_acceptance_rate 31 024.000 86.172 23.168 0.000 84.000 100.000 100.000
Table 2  Airbnb数据描述性统计
Fig.2  房源价格分布
Fig.3  部分变量与目标变量(price)热力图
Fig.4  数据缺失情况
Fig.5  Lasso特征选择
算法名称 算法参数配置
XGBoost n_estimators=300, learning_rate=0.08, gamma=0, subsample=0.75, colsample_bytree=1, max_depth=7, tree_method='approx'
LinearRegression normalize=False
Neural Network hidden_layer_sizestuple=100, activation='relu', solver='adam'
DecisionTree criterion='mse', min_samples_split=2
KNN weights='uniform'
RandomForest n_estimators=300, criterion='mse', max_depth=7
LightGBM objective='regression', n_estimators=300
SVR kernel='linear',gamma=0.1
ExtraTrees criterion='mse', min_samples_split=2
AdaBoost n_estimators=300, random_state=0
GBR n_estimators=300, learning_rate=0.08
Table 3  算法核心超参数配置
算法名称 RMSE MAE R-squared
XGBoost 0.092 0.066 0.793
RandomForest 0.110 0.083 0.702
LightGBM 0.092 0.067 0.790
SVR 0.116 0.087 0.669
ExtraTrees 0.096 0.069 0.773
Table 4  本文方法与已有方法的预测性能对比
算法名称 RMSE MAE R-squared
XGBoost 0.092 0.066 0.793
LinearRegression 0.115 0.087 0.672
Neural Network 0.114 0.084 0.680
DecisionTree 0.137 0.097 0.535
KNN 0.120 0.088 0.646
AdaBoost 0.129 0.100 0.590
GBR 0.098 0.072 0.765
Table 5  算法预测评价指标结果
参数名称 参数类别 参数含义 搜索空间 调优结果
learning_rate Booster参数 更新学习过程中的收缩步长 [0.07,0.075,0.08,0.085,0.09] 0.075
n_estimators 学习目标参数 控制弱学习器的数量 [450,500,550,600,650] 650
max_depth Booster参数 树的最大深度 [6-10] 8
subsample Booster参数 控制每棵树,随机采样的比例 [0.6,0.65, 0.7,0.75, 0.8,0.85, 0.9] 0.900
colsample_bytree Booster参数 建立树时对特征随机采样的比例 [0.8,0.85,0.9,0.95,1] 0.850
Table 6  XGBoost参数调优结果
Fig.6  XGBoost 与各分类模型的学习曲线对比
Fig.7  SHAP特征分析
Fig.8  SHAP特征依赖分析
排名 XGBoost RandomForest SHAP
特征 特征 特征
1 room_type_Entire home/apt 0.330 room_type_Entire home/apt 0.507 room_type_Entire home/apt 0.058
2 bedrooms 0.093 bathrooms 0.233 accommodates 0.034
3 room_type_Shared room 0.093 room_type_Shared room 0.050 longitude 0.029
4 property_type_Boutique hotel 0.049 longitude 0.041 bedrooms 0.026
5 bathrooms 0.047 cleaning_fee 0.036 bathrooms 0.019
6 room_type_Private room 0.040 accommodates 0.029 cleaning_fee 0.018
7 accommodates 0.039 host_listings_count 0.020 latitude 0.017
8 room_type_Hotel room 0.019 property_type_Boutique hotel 0.018 minimum_nights 0.016
9 property_type_villa 0.018 bedrooms 0.016 availability_365 0.010
10 property_type_Campsite 0.014 latitude 0.011 room_type_Shared room 0.008
Table 7  XGBoost,RandomForest,SHAP算法特征重要性对比
[1] 吴新宇, 吴捷. 在线短租市场研究——以蚂蚁短租为例[J]. 中外企业家, 2018(35):77-78.
[1] (Wu Xinyu, Wu Jie. Research on the Online Short-term Rental Market--Case Study of Ant Short-term Rental[J]. Chinese Foreign Entrepreneurs, 2018 (35):77-78.)
[2] 国家信息中心分享经济研究中心. 《中国共享住宿发展报告2020》[R]. 2020.
[2] (State Information Center Sharing Economic Research Center. China Shared Accommodation Development Report in 2020[R]. 2020.)
[3] 王保乾, 邓菲. 基于消费者偏好选择的短租房市场定价因素研究[J]. 统计与信息论坛, 2018,33(7):92-99.
[3] (Wang Baoqian, Deng Fei. Research on Market Pricing Factors Based on Consumer Preferences to Choose Short Rental[J]. Statistics & Information Forum, 2018,33(7):92-99.)
[4] Wang D, Nicolau J L. Price Determinants of Sharing Economy Based Accommodation Rental: A Study of Listings from 33 Cities on Airbnb.com[J]. International Journal of Hospitality Management, 2017,62:120-131.
doi: 10.1016/j.ijhm.2016.12.007
[5] 武亮. 共享经济下短租商业模式创新策略研究——基于途家短租模式的分析[J]. 价格理论与实践, 2019(1):149-152.
[5] (Wu Liang. Research on the Innovation Strategy of Short-term Business Model Under the Shared Economy——Analysis Based on Tujia Short-term Rental Model[J]. Price: Theory & Practice, 2019(1):149-152.)
[6] 徐燕, 戴菲. 分享经济下在线短租商业模式画布创新研究——基于小猪短租商业模式与途家短租比较分析[J]. 价格理论与实践, 2019(6):137-140.
[6] (Xu Yan, Dai Fei. Research on Canvas Innovation of Online Short-Term Business Model Under the Sharing Economy——Based on the Comparative Analysis of the Short-Term Business Model of Piglet and the Short-Term Rent of Tujia[J]. Price: Theory & Practice, 2019(6):137-140.)
[7] 李立威. 分享经济中多层信任的构建机制研究——基于Airbnb和小猪短租的案例分析[J]. 电子政务, 2019(2):101-107.
[7] (Li Liwei. Research on the Construction Mechanism of Multi-Layer Trust in the Sharing Economy——Based on the Cases of Airbnb and Xiaozhu Short-Term Rental[J]. E-Government, 2019(2):101-107.)
[8] 赵建欣, 朱阁, 宋玲玉. 在线短租平台用户住宿决策影响因素研究[J]. 北京邮电大学学报(社会科学版), 2017,19(5):56-61.
[8] (Zhao Jianxin, Zhu Ge, Song Lingyu. Influencing Factors of User Decision via Online Short-rent Platform[J]. Journal of Beijing University of Posts and Telecommunications(Social Sciences Edition), 2017,19(5):56-61.)
[9] 凌超, 张赞. “分享经济”在中国的发展路径研究——以在线短租为例[J]. 现代管理科学, 2014(10):36-38.
[9] (Ling Chao, Zhang Zan. Research on the Development Path of "Sharing Economy" in China—— Case Study of Online Short-term Rental[J]. Modern Management Science, 2014(10):36-38.)
[10] 阮连法, 张跃威, 张鑫. 基于特征价格与SVM的二手房价格评估[J]. 技术经济与管理研究, 2008(5):75-78.
[10] (Ruan Lianfa, Zhang Yuewei, Zhang Xin. Price Appraisal of Second-hand Housing Based on Hedonic Price and SVM[J]. Journal of Technical Economics & Management, 2008(5):75-78.)
[11] 徐戈, 张科. 基于随机森林模型的房产价格评估[J]. 统计与决策, 2014(17):22-25.
[11] (Xu Ge, Zhang Ke. Real Estate Price Evaluation Based on Random Forest Model[J]. Statistics & Decision, 2014(17):22-25.)
[12] 唐晓彬, 张瑞, 刘立新. 基于蝙蝠算法SVR模型的北京市二手房价预测研究[J]. 统计研究, 2018,35(11):71-81.
[12] (Tang Xiaobin, Zhang Rui, Liu Lixin. Research on Forecast of Second-hand House Price in Beijing Based on SVR Model of Bat Algorithm[J]. Statistical Research, 2018,35(11):71-81.)
[13] 董倩, 孙娜娜, 李伟. 基于网络搜索数据的房地产价格预测[J]. 统计研究, 2014,31(10):81-88.
[13] (Dong Qian, Sun Nana, Li Wei. Real Estate Price Prediction Based on Web Search Data[J]. Statistical Research, 2014,31(10):81-88.)
[14] 邓磊. 基于机器学习的酒店价格预测分析[D]. 南京: 东南大学, 2017.
[14] (Deng Lei. The Analysis of Hotel Price Prediction Based on Machine Learning[D]. Nanjing: Southeast University, 2017.)
[15] Zhang H L, Zhang J, Lu S J, et al. Modeling Hotel Room Price with Geographically Weighted Regression[J]. International Journal of Hospitality Management, 2011,30(4):1036-1043.
doi: 10.1016/j.ijhm.2011.03.010
[16] 夏学文. 商品房价格预测模型及其应用[J]. 统计学与应用, 2017,6(1):81-86.
[16] (Xia Xuewen. The Price Forecast Model of Commodity Houses and Its Application[J]. Statistics and Application, 2017,6(1):81-86.)
[17] 龙会典, 张海燕. 基于ARIMA模型的广州市商品房价格预测[J]. 商业研究, 2007(7):211-213.
[17] (Long Huidian, Zhang Haiyan. Prediction of Commodity Housing Prices in Guangzhou Based on ARIMA Model[J]. Commercial Research, 2007(7):211-213.)
[18] 谢勇, 项薇, 季孟忠, 等. 基于XGBoost和LightGBM算法预测住房月租金的应用分析[J]. 计算机应用与软件, 2019,36(9):151-155,191.
[18] (Xie Yong, Xiang Wei, Ji Mengzhong, et al. An Application and Analysis of Forecast Housing Rental Based on XGBoost and LightGBM Algorithms[J]. Computer Applications and Software, 2019,36(9):151-155,191.)
[19] Hu L R, He S J, Han Z X, et al. Monitoring Housing Rental Prices Based on Social Media: An Integrated Approach of Machine-Learning Algorithms and Hedonic Modeling to Inform Equitable Housing Policies[J]. Land Use Policy, 2019,82:657-673.
doi: 10.1016/j.landusepol.2018.12.030
[20] Parsa A B, Movahedi A, Taghipour H, et al. Toward Safer Highways, Application of XGBoost and SHAP for Real-Time Accident Detection and Feature Analysis[J]. Accident Analysis & Prevention, 2020,136:105405.
doi: 10.1016/j.aap.2019.105405
[21] Mangalathu S, Hwang S H, Jeon J S. Failure Mode and Effects Analysis of RC Members Based on Machine-learning-based SHapley Additive exPlanations (SHAP) Approach[J]. Engineering Structures, 2020,219:110927.
doi: 10.1016/j.engstruct.2020.110927
[22] Xu J S, Saleh M, Hatzopoulou M. A Machine Learning Approach Capturing the Effects of Driving Behaviour and Driver Characteristics on Trip-Level Emissions[J]. Atmospheric Environment, 2020,224:117311.
doi: 10.1016/j.atmosenv.2020.117311
[23] Sánchez-Franco M J, Alonso-Dos-Santos M. Exploring Gender-Based Influences on Key Features of Airbnb Accommodations[J/OL]. Economic Research-Ekonomska Istraživanja, https://doi.org/10.1080/1331677X.2020.1831943.
[24] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
[25] 朱明, 王春梅, 高翔, 等. XGBoost在卫星网络协调态势预测中的应用[J]. 小型微型计算机系统, 2019,40(12):2561-2565.
[25] (Zhu Ming, Wang Chunmei, Gao Xiang, et al. Application of XGBoost in the Prediction of Satellite Network Coordination Situation[J]. Journal of Chinese Computer Systems, 2019,40(12):2561-2565.)
[26] 杨贵军, 徐雪, 赵富强. 基于XGBoost算法的用户评分预测模型及应用[J]. 数据分析与知识发现, 2019,3(1):118-126.
[26] (Yang Guijun, Xu Xue, Zhao Fuqiang. Predicting User Ratings with XGBoost Algorithm[J]. Data Analysis and Knowledge Discovery, 2019,3(1):118-126.)
[27] 丁勇, 陈夕, 蒋翠清, 等. 一种融合网络表示学习与XGBoost的评分预测模型[J]. 数据分析与知识发现, 2020,4(11):52-62.
[27] (Ding Yong, Chen Xi, Jiang Cuiqing, et al. A Rating Prediction Model by Integrating Network Representation Learning and XGBoost[J]. Data Analysis and Knowledge Discovery, 2020,4(11):52-62.)
[28] Lundberg S M, Lee S I. A Unified Approach to Interpreting Model Predictions[C]// Proceedings of Annual Conference on Neural Information Processing Systems. 2017: 4765-4774.
[1] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[2] 向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 *[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[3] 梁家铭, 赵洁, 郑鹏, 黄流深, 叶敏祺, 董振宁. 特征选择下融合图像和文本分析的在线短租平台信任计算框架 *[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[4] 柴国荣,王斌,沙勇忠. 基于多机器学习方法联合的公共卫生风险预测研究——以兰州市流感预测为例*[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[5] 陈东,王建冬,李慧颖,蔡思航,黄倩倩,易成岐,曹攀. 融合机器学习算法和多因素的禽肉交易量预测方法研究 *[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[6] 梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[7] 杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[8] 王树义,刘赛,马峥. 基于深度迁移学习的微博图像隐私分类研究*[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[9] 王若佳,张璐,王继民. 基于机器学习的在线问诊平台智能分诊研究[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[10] 李纲,周华阳,毛进,陈思菁. 基于机器学习的社交媒体用户分类研究 *[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[11] 胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[12] 张金柱,胡一鸣. 融合表示学习与机器学习的专利科学引文标题自动抽取研究*[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[13] 刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[14] 徐红霞,李春旺. 科技文献内容知识点抽取研究综述[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[15] 李静,潘舒笑,李雪岩,贾立静,赵宇卓. 基于多目标量子优化分类器的急诊危重患者关键指标筛选 *[J]. 数据分析与知识发现, 2019, 3(12): 101-112.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn