Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (6): 51-65    DOI: 10.11925/infotech.2096-3467.2020.1186
Current Issue | Archive | Adv Search |
Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost
Cao Rui1,Liao Bin1(),Li Min1,2,Sun Ruina1,3,4
1College of Statistics and Data Science, Xinjiang University of Finance & Economics, Urumqi 830012, China
2School of Information Science and Engineering, Xinjiang University, Urumqi 830008, China
3Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
4School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
Download: PDF (1982 KB)   HTML ( 49
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposed a model to predict prices and analyze properties of online short-term rentals based on XGBoost, aiming to address the issue of lacking reasonable pricing suggestion mechanism for housing with different characteristics. [Methods] We collected data from the Airbnb platform and used Lasso to extract features from these raw data as well as reduced their dimensions. Then, we input the extracted data to XGBoost and iteratively trained the prediction model. Finally, we used the SHAP value to interpret the model features. [Results] The RMSE, MAE and R-squared values of the proposed model were 0.091, 0.065 and 0.798 respectively after tuning the hyperparameters, which were better than those of the four existing models. [Limitations] Our new model could not merge the features of real-time online business data, which influenced the prediction accuracy. [Conclusions] The proposed model has good interpretability, and could identify the key factors affecting housing prices, which helps the landlords improve services.

Key wordsMachine Learning      Pricing Model      Online Short-Term Rental      XGBoost Model      SHAP Value     
Received: 29 November 2020      Published: 06 July 2021
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(61562078);Tianshan Youth Program of Xinjiang(2018Q073)
Corresponding Authors: Liao Bin     E-mail: liaobin665@163.com

Cite this article:

Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost. Data Analysis and Knowledge Discovery, 2021, 5(6): 51-65.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.1186     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I6/51

The Procession of XGBoost Modeling
变量名称 变量解释 变量类型 变量名称 变量解释 变量类型
price 房源价格 数值型 review_scores_rating 评论分数等级 数值型
host_is_superhost 是否为超级房东 布尔型 review_scores_accuracy 如实描述得分 数值型
host_listings_count Airbnb.com上列出的房东的房源数量 数值型 review_scores_cleanliness 干净卫生得分 数值型
latitude 纬度 数值型 review_scores_checkin 入住顺利得分 数值型
longitude 经度 数值型 review_scores_communication 沟通交流得分 数值型
accommodates 可容纳人数 数值型 review_scores_location 位置便利得分 数值型
bathrooms 房源的浴室数量(间) 数值型 review_scores_value 高性价比得分 数值型
bedrooms 房源的卧室数量(间) 数值型 instant_bookable 能否即时预订 数值型
beds 房源的床数量(张) 数值型 reviews_per_month 月均评论数量 数值型
security_deposit 押金 数值型 extra_people_fee 额外费用 数值型
cleaning_fee 清洁费用 数值型 host_response_rate 房东回复速率 数值型
guests_included 房源实际的入住人数 数值型 host_acceptance_rate 房东接单速率 数值型
minimum_nights 房东要求的租户最少入住的天数 数值型 amenities 便捷设施 字符型
maximum_nights 房东要求的租户最多入住的天数 数值型 host_verifications 房东身份资料 字符型
availability_365 365天能提供天数 数值型 cancellation_policy 取消政策 字符型
number_of_reviews 评论数量 数值型
Basic Characteristics of Airbnb Data
变量名称 count mean std min 25% 75% max
price 37 048.000 227.916 685.160 0.000 69.000 185.000 25 000.000
host_is_superhost 37 048.000 0.324 0.468 0.000 0.000 1.000 1.000
accommodates 37 048.000 3.646 2.689 0.000 2.000 4.000 24.000
bathrooms 37 013.000 1.475 1.014 0.000 1.000 2.000 16.000
bedrooms 36 924.000 1.444 1.138 0.000 1.000 2.000 13.000
beds 36 667.000 1.969 1.679 0.000 1.000 2.000 50.000
security_deposit 37 048.000 372.586 2 231.724 0.000 0.000 300.000 250 000.000
cleaning_fee 37 048.000 83.825 100.025 0.000 20.000 109.000 2 500.000
guests_included 37 048.000 1.917 1.770 1.000 1.000 2.000 24.000
minimum_nights 37 048.000 12.715 26.759 1.000 1.000 30.000 1 125.000
maximum_nights 37 048.000 658.116 525.576 1.000 40.000 1125.000 10 004.000
availability_365 37 048.000 168.061 142.799 0.000 5.000 336.000 365.000
number_of_reviews 37 048.000 35.201 64.277 0.000 1.000 40.000 822.000
review_scores_rating 28 962.000 94.272 9.110 20.000 93.000 100.000 100.000
review_scores_accuracy 28 914.000 9.610 0.897 2.000 9.000 10.000 10.000
review_scores_cleanliness 28 915.000 9.418 1.011 2.000 9.000 10.000 10.000
review_scores_checkin 28 902.000 9.475 0.786 2.000 10.000 10.000 10.000
review_scores_communication 28 913.000 9.714 0.838 2.000 10.000 10.000 10.000
review_scores_location 28 898.000 9.707 0.730 2.000 9.000 10.000 10.000
review_scores_value 28 894.000 9.429 0.943 2.000 9.000 10.000 10.000
instant_bookable 37 048.000 0.432 0.495 0.000 0.000 1.000 1.000
reviews_per_month 29 413.000 1.605 1.750 0.010 0.300 2.410 17.230
extra_people_fee 37 048.000 0.507 0.499 0.000 0.000 1.000 1.000
host_response_rate 27 937.000 93.513 18.156 0.000 99.000 100.000 100.000
host_acceptance_rate 31 024.000 86.172 23.168 0.000 84.000 100.000 100.000
Descriptive Statistics of Airbnb Data
The Distribution of Listing Price
Heatmap of Partial Variables and Target Variable
Missing Variables
Lasso Feature Selection
算法名称 算法参数配置
XGBoost n_estimators=300, learning_rate=0.08, gamma=0, subsample=0.75, colsample_bytree=1, max_depth=7, tree_method='approx'
LinearRegression normalize=False
Neural Network hidden_layer_sizestuple=100, activation='relu', solver='adam'
DecisionTree criterion='mse', min_samples_split=2
KNN weights='uniform'
RandomForest n_estimators=300, criterion='mse', max_depth=7
LightGBM objective='regression', n_estimators=300
SVR kernel='linear',gamma=0.1
ExtraTrees criterion='mse', min_samples_split=2
AdaBoost n_estimators=300, random_state=0
GBR n_estimators=300, learning_rate=0.08
Hyperparameter Configuration for Algorithms
算法名称 RMSE MAE R-squared
XGBoost 0.092 0.066 0.793
RandomForest 0.110 0.083 0.702
LightGBM 0.092 0.067 0.790
SVR 0.116 0.087 0.669
ExtraTrees 0.096 0.069 0.773
Prediction Performance of Proposed Method with Existing Methods
算法名称 RMSE MAE R-squared
XGBoost 0.092 0.066 0.793
LinearRegression 0.115 0.087 0.672
Neural Network 0.114 0.084 0.680
DecisionTree 0.137 0.097 0.535
KNN 0.120 0.088 0.646
AdaBoost 0.129 0.100 0.590
GBR 0.098 0.072 0.765
Algorithm Prediction Results
参数名称 参数类别 参数含义 搜索空间 调优结果
learning_rate Booster参数 更新学习过程中的收缩步长 [0.07,0.075,0.08,0.085,0.09] 0.075
n_estimators 学习目标参数 控制弱学习器的数量 [450,500,550,600,650] 650
max_depth Booster参数 树的最大深度 [6-10] 8
subsample Booster参数 控制每棵树,随机采样的比例 [0.6,0.65, 0.7,0.75, 0.8,0.85, 0.9] 0.900
colsample_bytree Booster参数 建立树时对特征随机采样的比例 [0.8,0.85,0.9,0.95,1] 0.850
XGBoost Parameter Tuning Results
Performance of XGBoost with LightGBM, RandomForest and SVR
SHAP Feature Analysis
SHAP Feature Dependence Analysis
排名 XGBoost RandomForest SHAP
特征 特征 特征
1 room_type_Entire home/apt 0.330 room_type_Entire home/apt 0.507 room_type_Entire home/apt 0.058
2 bedrooms 0.093 bathrooms 0.233 accommodates 0.034
3 room_type_Shared room 0.093 room_type_Shared room 0.050 longitude 0.029
4 property_type_Boutique hotel 0.049 longitude 0.041 bedrooms 0.026
5 bathrooms 0.047 cleaning_fee 0.036 bathrooms 0.019
6 room_type_Private room 0.040 accommodates 0.029 cleaning_fee 0.018
7 accommodates 0.039 host_listings_count 0.020 latitude 0.017
8 room_type_Hotel room 0.019 property_type_Boutique hotel 0.018 minimum_nights 0.016
9 property_type_villa 0.018 bedrooms 0.016 availability_365 0.010
10 property_type_Campsite 0.014 latitude 0.011 room_type_Shared room 0.008
The Feature Importance of XGBoost,RandomForest and SHAP
[1] 吴新宇, 吴捷. 在线短租市场研究——以蚂蚁短租为例[J]. 中外企业家, 2018(35):77-78.
[1] (Wu Xinyu, Wu Jie. Research on the Online Short-term Rental Market--Case Study of Ant Short-term Rental[J]. Chinese Foreign Entrepreneurs, 2018 (35):77-78.)
[2] 国家信息中心分享经济研究中心. 《中国共享住宿发展报告2020》[R]. 2020.
[2] (State Information Center Sharing Economic Research Center. China Shared Accommodation Development Report in 2020[R]. 2020.)
[3] 王保乾, 邓菲. 基于消费者偏好选择的短租房市场定价因素研究[J]. 统计与信息论坛, 2018,33(7):92-99.
[3] (Wang Baoqian, Deng Fei. Research on Market Pricing Factors Based on Consumer Preferences to Choose Short Rental[J]. Statistics & Information Forum, 2018,33(7):92-99.)
[4] Wang D, Nicolau J L. Price Determinants of Sharing Economy Based Accommodation Rental: A Study of Listings from 33 Cities on Airbnb.com[J]. International Journal of Hospitality Management, 2017,62:120-131.
doi: 10.1016/j.ijhm.2016.12.007
[5] 武亮. 共享经济下短租商业模式创新策略研究——基于途家短租模式的分析[J]. 价格理论与实践, 2019(1):149-152.
[5] (Wu Liang. Research on the Innovation Strategy of Short-term Business Model Under the Shared Economy——Analysis Based on Tujia Short-term Rental Model[J]. Price: Theory & Practice, 2019(1):149-152.)
[6] 徐燕, 戴菲. 分享经济下在线短租商业模式画布创新研究——基于小猪短租商业模式与途家短租比较分析[J]. 价格理论与实践, 2019(6):137-140.
[6] (Xu Yan, Dai Fei. Research on Canvas Innovation of Online Short-Term Business Model Under the Sharing Economy——Based on the Comparative Analysis of the Short-Term Business Model of Piglet and the Short-Term Rent of Tujia[J]. Price: Theory & Practice, 2019(6):137-140.)
[7] 李立威. 分享经济中多层信任的构建机制研究——基于Airbnb和小猪短租的案例分析[J]. 电子政务, 2019(2):101-107.
[7] (Li Liwei. Research on the Construction Mechanism of Multi-Layer Trust in the Sharing Economy——Based on the Cases of Airbnb and Xiaozhu Short-Term Rental[J]. E-Government, 2019(2):101-107.)
[8] 赵建欣, 朱阁, 宋玲玉. 在线短租平台用户住宿决策影响因素研究[J]. 北京邮电大学学报(社会科学版), 2017,19(5):56-61.
[8] (Zhao Jianxin, Zhu Ge, Song Lingyu. Influencing Factors of User Decision via Online Short-rent Platform[J]. Journal of Beijing University of Posts and Telecommunications(Social Sciences Edition), 2017,19(5):56-61.)
[9] 凌超, 张赞. “分享经济”在中国的发展路径研究——以在线短租为例[J]. 现代管理科学, 2014(10):36-38.
[9] (Ling Chao, Zhang Zan. Research on the Development Path of "Sharing Economy" in China—— Case Study of Online Short-term Rental[J]. Modern Management Science, 2014(10):36-38.)
[10] 阮连法, 张跃威, 张鑫. 基于特征价格与SVM的二手房价格评估[J]. 技术经济与管理研究, 2008(5):75-78.
[10] (Ruan Lianfa, Zhang Yuewei, Zhang Xin. Price Appraisal of Second-hand Housing Based on Hedonic Price and SVM[J]. Journal of Technical Economics & Management, 2008(5):75-78.)
[11] 徐戈, 张科. 基于随机森林模型的房产价格评估[J]. 统计与决策, 2014(17):22-25.
[11] (Xu Ge, Zhang Ke. Real Estate Price Evaluation Based on Random Forest Model[J]. Statistics & Decision, 2014(17):22-25.)
[12] 唐晓彬, 张瑞, 刘立新. 基于蝙蝠算法SVR模型的北京市二手房价预测研究[J]. 统计研究, 2018,35(11):71-81.
[12] (Tang Xiaobin, Zhang Rui, Liu Lixin. Research on Forecast of Second-hand House Price in Beijing Based on SVR Model of Bat Algorithm[J]. Statistical Research, 2018,35(11):71-81.)
[13] 董倩, 孙娜娜, 李伟. 基于网络搜索数据的房地产价格预测[J]. 统计研究, 2014,31(10):81-88.
[13] (Dong Qian, Sun Nana, Li Wei. Real Estate Price Prediction Based on Web Search Data[J]. Statistical Research, 2014,31(10):81-88.)
[14] 邓磊. 基于机器学习的酒店价格预测分析[D]. 南京: 东南大学, 2017.
[14] (Deng Lei. The Analysis of Hotel Price Prediction Based on Machine Learning[D]. Nanjing: Southeast University, 2017.)
[15] Zhang H L, Zhang J, Lu S J, et al. Modeling Hotel Room Price with Geographically Weighted Regression[J]. International Journal of Hospitality Management, 2011,30(4):1036-1043.
doi: 10.1016/j.ijhm.2011.03.010
[16] 夏学文. 商品房价格预测模型及其应用[J]. 统计学与应用, 2017,6(1):81-86.
[16] (Xia Xuewen. The Price Forecast Model of Commodity Houses and Its Application[J]. Statistics and Application, 2017,6(1):81-86.)
[17] 龙会典, 张海燕. 基于ARIMA模型的广州市商品房价格预测[J]. 商业研究, 2007(7):211-213.
[17] (Long Huidian, Zhang Haiyan. Prediction of Commodity Housing Prices in Guangzhou Based on ARIMA Model[J]. Commercial Research, 2007(7):211-213.)
[18] 谢勇, 项薇, 季孟忠, 等. 基于XGBoost和LightGBM算法预测住房月租金的应用分析[J]. 计算机应用与软件, 2019,36(9):151-155,191.
[18] (Xie Yong, Xiang Wei, Ji Mengzhong, et al. An Application and Analysis of Forecast Housing Rental Based on XGBoost and LightGBM Algorithms[J]. Computer Applications and Software, 2019,36(9):151-155,191.)
[19] Hu L R, He S J, Han Z X, et al. Monitoring Housing Rental Prices Based on Social Media: An Integrated Approach of Machine-Learning Algorithms and Hedonic Modeling to Inform Equitable Housing Policies[J]. Land Use Policy, 2019,82:657-673.
doi: 10.1016/j.landusepol.2018.12.030
[20] Parsa A B, Movahedi A, Taghipour H, et al. Toward Safer Highways, Application of XGBoost and SHAP for Real-Time Accident Detection and Feature Analysis[J]. Accident Analysis & Prevention, 2020,136:105405.
doi: 10.1016/j.aap.2019.105405
[21] Mangalathu S, Hwang S H, Jeon J S. Failure Mode and Effects Analysis of RC Members Based on Machine-learning-based SHapley Additive exPlanations (SHAP) Approach[J]. Engineering Structures, 2020,219:110927.
doi: 10.1016/j.engstruct.2020.110927
[22] Xu J S, Saleh M, Hatzopoulou M. A Machine Learning Approach Capturing the Effects of Driving Behaviour and Driver Characteristics on Trip-Level Emissions[J]. Atmospheric Environment, 2020,224:117311.
doi: 10.1016/j.atmosenv.2020.117311
[23] Sánchez-Franco M J, Alonso-Dos-Santos M. Exploring Gender-Based Influences on Key Features of Airbnb Accommodations[J/OL]. Economic Research-Ekonomska Istraživanja, https://doi.org/10.1080/1331677X.2020.1831943.
[24] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
[25] 朱明, 王春梅, 高翔, 等. XGBoost在卫星网络协调态势预测中的应用[J]. 小型微型计算机系统, 2019,40(12):2561-2565.
[25] (Zhu Ming, Wang Chunmei, Gao Xiang, et al. Application of XGBoost in the Prediction of Satellite Network Coordination Situation[J]. Journal of Chinese Computer Systems, 2019,40(12):2561-2565.)
[26] 杨贵军, 徐雪, 赵富强. 基于XGBoost算法的用户评分预测模型及应用[J]. 数据分析与知识发现, 2019,3(1):118-126.
[26] (Yang Guijun, Xu Xue, Zhao Fuqiang. Predicting User Ratings with XGBoost Algorithm[J]. Data Analysis and Knowledge Discovery, 2019,3(1):118-126.)
[27] 丁勇, 陈夕, 蒋翠清, 等. 一种融合网络表示学习与XGBoost的评分预测模型[J]. 数据分析与知识发现, 2020,4(11):52-62.
[27] (Ding Yong, Chen Xi, Jiang Cuiqing, et al. A Rating Prediction Model by Integrating Network Representation Learning and XGBoost[J]. Data Analysis and Knowledge Discovery, 2020,4(11):52-62.)
[28] Lundberg S M, Lee S I. A Unified Approach to Interpreting Model Predictions[C]// Proceedings of Annual Conference on Neural Information Processing Systems. 2017: 4765-4774.
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[5] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[6] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[7] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[8] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[9] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[10] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[11] Wang Shuyi,Liu Sai,Ma Zheng. Microblog Image Privacy Classification with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[12] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[13] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[14] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[15] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn