Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (6): 51-65    DOI: 10.11925/infotech.2096-3467.2020.1186
Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost
Cao Rui1,Liao Bin1(),Li Min1,2,Sun Ruina1,3,4
1College of Statistics and Data Science, Xinjiang University of Finance & Economics, Urumqi 830012, China
2School of Information Science and Engineering, Xinjiang University, Urumqi 830008, China
3Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
4School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
[Objective] This paper proposed a model to predict prices and analyze properties of online short-term rentals based on XGBoost, aiming to address the issue of lacking reasonable pricing suggestion mechanism for housing with different characteristics. [Methods] We collected data from the Airbnb platform and used Lasso to extract features from these raw data as well as reduced their dimensions. Then, we input the extracted data to XGBoost and iteratively trained the prediction model. Finally, we used the SHAP value to interpret the model features. [Results] The RMSE, MAE and R-squared values of the proposed model were 0.091, 0.065 and 0.798 respectively after tuning the hyperparameters, which were better than those of the four existing models. [Limitations] Our new model could not merge the features of real-time online business data, which influenced the prediction accuracy. [Conclusions] The proposed model has good interpretability, and could identify the key factors affecting housing prices, which helps the landlords improve services.

Key wordsMachine Learning      Pricing Model      Online Short-Term Rental      XGBoost Model      SHAP Value     
Received: 29 November 2020      Published: 06 July 2021
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(61562078);Tianshan Youth Program of Xinjiang(2018Q073)
Corresponding Authors: Liao Bin     E-mail:

Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost. Data Analysis and Knowledge Discovery, 2021, 5(6): 51-65.

The Procession of XGBoost Modeling
变量名称 变量解释 变量类型 变量名称 变量解释 变量类型
price 房源价格 数值型 review_scores_rating 评论分数等级 数值型
host_is_superhost 是否为超级房东 布尔型 review_scores_accuracy 如实描述得分 数值型
host_listings_count Airbnb.com上列出的房东的房源数量 数值型 review_scores_cleanliness 干净卫生得分 数值型
latitude 纬度 数值型 review_scores_checkin 入住顺利得分 数值型
longitude 经度 数值型 review_scores_communication 沟通交流得分 数值型
accommodates 可容纳人数 数值型 review_scores_location 位置便利得分 数值型
bathrooms 房源的浴室数量(间) 数值型 review_scores_value 高性价比得分 数值型
bedrooms 房源的卧室数量(间) 数值型 instant_bookable 能否即时预订 数值型
beds 房源的床数量(张) 数值型 reviews_per_month 月均评论数量 数值型
security_deposit 押金 数值型 extra_people_fee 额外费用 数值型
cleaning_fee 清洁费用 数值型 host_response_rate 房东回复速率 数值型
guests_included 房源实际的入住人数 数值型 host_acceptance_rate 房东接单速率 数值型
minimum_nights 房东要求的租户最少入住的天数 数值型 amenities 便捷设施 字符型
maximum_nights 房东要求的租户最多入住的天数 数值型 host_verifications 房东身份资料 字符型
availability_365 365天能提供天数 数值型 cancellation_policy 取消政策 字符型
number_of_reviews 评论数量 数值型
Basic Characteristics of Airbnb Data
变量名称 count mean std min 25% 75% max
price 37 048.000 227.916 685.160 0.000 69.000 185.000 25 000.000
host_is_superhost 37 048.000 0.324 0.468 0.000 0.000 1.000 1.000
accommodates 37 048.000 3.646 2.689 0.000 2.000 4.000 24.000
bathrooms 37 013.000 1.475 1.014 0.000 1.000 2.000 16.000
bedrooms 36 924.000 1.444 1.138 0.000 1.000 2.000 13.000
beds 36 667.000 1.969 1.679 0.000 1.000 2.000 50.000
security_deposit 37 048.000 372.586 2 231.724 0.000 0.000 300.000 250 000.000
cleaning_fee 37 048.000 83.825 100.025 0.000 20.000 109.000 2 500.000
guests_included 37 048.000 1.917 1.770 1.000 1.000 2.000 24.000
minimum_nights 37 048.000 12.715 26.759 1.000 1.000 30.000 1 125.000
maximum_nights 37 048.000 658.116 525.576 1.000 40.000 1125.000 10 004.000
availability_365 37 048.000 168.061 142.799 0.000 5.000 336.000 365.000
number_of_reviews 37 048.000 35.201 64.277 0.000 1.000 40.000 822.000
review_scores_rating 28 962.000 94.272 9.110 20.000 93.000 100.000 100.000
review_scores_accuracy 28 914.000 9.610 0.897 2.000 9.000 10.000 10.000
review_scores_cleanliness 28 915.000 9.418 1.011 2.000 9.000 10.000 10.000
review_scores_checkin 28 902.000 9.475 0.786 2.000 10.000 10.000 10.000
review_scores_communication 28 913.000 9.714 0.838 2.000 10.000 10.000 10.000
review_scores_location 28 898.000 9.707 0.730 2.000 9.000 10.000 10.000
review_scores_value 28 894.000 9.429 0.943 2.000 9.000 10.000 10.000
instant_bookable 37 048.000 0.432 0.495 0.000 0.000 1.000 1.000
reviews_per_month 29 413.000 1.605 1.750 0.010 0.300 2.410 17.230
extra_people_fee 37 048.000 0.507 0.499 0.000 0.000 1.000 1.000
host_response_rate 27 937.000 93.513 18.156 0.000 99.000 100.000 100.000
host_acceptance_rate 31 024.000 86.172 23.168 0.000 84.000 100.000 100.000
Descriptive Statistics of Airbnb Data
The Distribution of Listing Price
Heatmap of Partial Variables and Target Variable
Missing Variables
Lasso Feature Selection
算法名称 算法参数配置
XGBoost n_estimators=300, learning_rate=0.08, gamma=0, subsample=0.75, colsample_bytree=1, max_depth=7, tree_method='approx'
LinearRegression normalize=False
Neural Network hidden_layer_sizestuple=100, activation='relu', solver='adam'
DecisionTree criterion='mse', min_samples_split=2
KNN weights='uniform'
RandomForest n_estimators=300, criterion='mse', max_depth=7
LightGBM objective='regression', n_estimators=300
SVR kernel='linear',gamma=0.1
ExtraTrees criterion='mse', min_samples_split=2
AdaBoost n_estimators=300, random_state=0
GBR n_estimators=300, learning_rate=0.08
Hyperparameter Configuration for Algorithms
算法名称 RMSE MAE R-squared
XGBoost 0.092 0.066 0.793
RandomForest 0.110 0.083 0.702
LightGBM 0.092 0.067 0.790
SVR 0.116 0.087 0.669
ExtraTrees 0.096 0.069 0.773
Prediction Performance of Proposed Method with Existing Methods
算法名称 RMSE MAE R-squared
XGBoost 0.092 0.066 0.793
LinearRegression 0.115 0.087 0.672
Neural Network 0.114 0.084 0.680
DecisionTree 0.137 0.097 0.535
KNN 0.120 0.088 0.646
AdaBoost 0.129 0.100 0.590
GBR 0.098 0.072 0.765
Algorithm Prediction Results
参数名称 参数类别 参数含义 搜索空间 调优结果
learning_rate Booster参数 更新学习过程中的收缩步长 [0.07,0.075,0.08,0.085,0.09] 0.075
n_estimators 学习目标参数 控制弱学习器的数量 [450,500,550,600,650] 650
max_depth Booster参数 树的最大深度 [6-10] 8
subsample Booster参数 控制每棵树,随机采样的比例 [0.6,0.65, 0.7,0.75, 0.8,0.85, 0.9] 0.900
colsample_bytree Booster参数 建立树时对特征随机采样的比例 [0.8,0.85,0.9,0.95,1] 0.850
XGBoost Parameter Tuning Results
Performance of XGBoost with LightGBM, RandomForest and SVR
SHAP Feature Analysis
SHAP Feature Dependence Analysis
排名 XGBoost RandomForest SHAP
特征 特征 特征
1 room_type_Entire home/apt 0.330 room_type_Entire home/apt 0.507 room_type_Entire home/apt 0.058
2 bedrooms 0.093 bathrooms 0.233 accommodates 0.034
3 room_type_Shared room 0.093 room_type_Shared room 0.050 longitude 0.029
4 property_type_Boutique hotel 0.049 longitude 0.041 bedrooms 0.026
5 bathrooms 0.047 cleaning_fee 0.036 bathrooms 0.019
6 room_type_Private room 0.040 accommodates 0.029 cleaning_fee 0.018
7 accommodates 0.039 host_listings_count 0.020 latitude 0.017
8 room_type_Hotel room 0.019 property_type_Boutique hotel 0.018 minimum_nights 0.016
9 property_type_villa 0.018 bedrooms 0.016 availability_365 0.010
10 property_type_Campsite 0.014 latitude 0.011 room_type_Shared room 0.008
The Feature Importance of XGBoost,RandomForest and SHAP
