Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (4): 152-166    DOI: 10.11925/infotech.2096-3467.2023.0389
Automatic Recognition of Exploratory and Lookup Intents Based on Berry Picking Model
Liu Jie1,Gui Sisi2,Zhang Xiaojuan3()
1College of Computer and Information Science, Southwest University, Chongqing 400715, China
2College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
3School of Public Administration, Sichuan University, Chengdu 610065, China
[Objective] This paper selects several new classification features to improve the accuracy of automatic recognition of exploratory and lookup intents. [Methods] Firstly, we collected 1805 queries from the AOL search log and manually labelled them. Then, we proposed classification features from three aspects: query nature, search process, and information source inspired by the Berry Picking model. Third, we evaluated the performance of the proposed features in Naive Bayes, SVM, Decision Tree, Random Forest, and Neural Network. Finally, we explored the classification performance of individual features and feature sets. [Results] The three types of classification features can effectively distinguish exploratory and lookup intentions, with query nature-based features achieving the best performance. Among the five classification models, the neural network algorithm-based model performed the best (Accuracy=0.817 2,Precision=0.849 4,Recall=0.774 7,F1 Score=0.810 3). [Limitations] We did not examine the performances of newly proposed classification features with multiple datasets. User searching behaviors need to be fully explored to form more effective classification features. Moreover, the dataset applied to exploratory/lookup intent recognition was limited due to the high time consumption and labor cost of manual labelling. [Conclusions] The proposed features based on the Berry Picking model can effectively distinguish between exploratory and lookup intents.

Key wordsQuery Intent Recognition      Exploratory Intent      Lookup Intent      BerryPicking Model     
Received: 29 April 2023      Published: 13 September 2023
ZTFLH:  TP393  
Fund:National Social Science Fund of China(19CTQ023)
Corresponding Authors: Zhang Xiaojuan,ORCID:0000-0002-5889-5922, E-mail:。   

Liu Jie, Gui Sisi, Zhang Xiaojuan. Automatic Recognition of Exploratory and Lookup Intents Based on Berry Picking Model. Data Analysis and Knowledge Discovery, 2024, 8(4): 152-166.

Category System of Search Activities Defined by Marchionini
BerryPicking Model
查询重构类型 特征描述 查询示例
新建 Q i Q i + 1不包含任何共同术语 Q i:“back to the future”
Q i + 1:“holiday mansion houseboat”
添加 Q i Q i + 1的子集,即 Q i + 1中术语大于等于 Q i中术语 Q i:“select business servic。es”
Q i + 1:“select business services title”
替换 Q i Q i + 1包含至少一个共同术语和至少一个不同术语 Q i:“national real estate settlement services”
Q i + 1:“Pennsylvania real estate settlement services”
删除 Q i Q i + 1的超集,即 Q i中的术语个数大于等于 Q i + 1中的术语个数 Q i:“auto locator Pennsylvania”
Q i + 1:“auto locator”
重复 Q i Q i + 1包含完全相同的术语,这些术语的顺序可能不同 Q i:“coats tire equipment”
Q i + 1:“coats tire equipment”
Definition and Examples of Query Reformulating Types
采莓模型特征 本文分类特征
查询性质类 查询术语多样性(DQT) 查询 q所在的所有session中相邻两查询之间非共现术语所占比值的平均值
查询语义多样性(DQS) 查询 q所在的所有session中相邻两查询向量之间语义多样性的平均值
查询重构相关特征 (1)所有查询重构类型数的平均值(AART):查询 q所在session中,查询重构类型总数与session数的比值。
(2)每种查询重构类型的平均频率(AFRT):查询 q所在session中,某一重构类型数占同一session所有重构类型数的比值的总和与session数的比值
搜索过程类 查询重构路径平均长度(QRPL) 查询 q所在的所有session中全部查询重构路径的平均长度
查询重构路径平均时间间隔(AVTI) 查询 q所在的所有session中完成一个session中全部查询重构路径的平均时间间隔
重构路径类型数的平均值(ANRPT) 查询 q所在session中,查询重构路径类型总数与session数的比值
信息来源类 URL深度(UDP) 查询 q所在的所有session中共存于同一session的相邻两个URL路径中“/”的平均数量
URL多样性(UDV) 查询 q所在的所有session中共存于同一session的相邻两个URL之间的不同URL片段所占比值的平均值
Features of Multiple Categories Selected in This Paper
AOL Dataset Format
特征 查找式意图 探索式意图 全部
Precision Recall F1 Precision Recall F1 Accuracy
Baseline 0.750 0 0.536 3 0.625 4 0.643 8 0.824 2 0.722 9 0.681 4
所有新特征 0.848 1
0.748 6
0.795 3
0.778 3
0.868 1
0.820 8
0.808 9
Baseline+查询性质类特征 0.846 2
0.737 4
0.788 1
0.770 7
0.868 1
0.816 5
0.803 3
Baseline+搜索过程类特征 0.811 6
0.625 7
0.706 7
0.699 6
0.857 1
0.770 4
0.742 4
Baseline+信息来源类特征 0.757 1
0.592 2
0.664 6
0.669 7
0.813 2
0.734 5
0.703 6
Baseline+所有新特征 0.839 5
0.759 8
0.797 7
0.783 9
0.857 1
0.818 9
0.808 9
Comparison and Analysis of Intention Recognition Results
分类模型 特征 Accuracy Precision Recall F1
朴素贝叶斯 ①查询性质特征集 0.606 6 0.603 1 0.642 9 0.622 4
②搜索过程特征集 0.565 1 0.562 8 0.615 4 0.587 9
③信息来源特征集 0.506 9 0.511 2 0.500 0 0.505 5
①+② 0.703 6 0.684 7 0.763 7 0.722 0
①+③ 0.700 8 0.814 2 0.514 0 0.630 1
②+③ 0.728 5 0.681 0 0.868 1 0.763 3
①+②+③ 0.772 9 0.750 0 0.824 2 0.785 4
SVM ①查询性质特征集 0.783 9 0.773 7 0.807 7 0.790 3
②搜索过程特征集 0.728 5 0.684 2 0.857 1 0.761 0
③信息来源特征集 0.578 9 0.564 1 0.725 3 0.634 6
①+② 0.803 3 0.800 0 0.813 2 0.806 5
①+③ 0.808 9 0.795 8 0.835 2 0.815 0
②+③ 0.728 5 0.685 8 0.851 6 0.759 8
①+②+③ 0.811 6 0.806 5 0.824 2 0.815 3
决策树 ①查询性质特征集 0.781 2 0.748 8 0.851 6 0.796 9
②搜索过程特征集 0.739 6 0.693 0 0.868 1 0.771 0
③信息来源特征集 0.659 3 0.613 0 0.879 1 0.722 3
①+② 0.783 9 0.757 4 0.840 7 0.796 9
①+③ 0.781 2 0.748 8 0.851 6 0.796 9
②+③ 0.764 5 0.762 2 0.774 7 0.768 4
①+②+③ 0.789 5 0.767 7 0.835 2 0.800 0
随机森林 ①查询性质特征集 0.797 8 0.771 1 0.851 6 0.809 4
②搜索过程特征集 0.759 0 0.723 0 0.846 2 0.779 8
③信息来源特征集 0.684 2 0.634 9 0.879 1 0.737 3
①+② 0.803 3 0.770 7 0.868 1 0.816 5
①+③ 0.806 1 0.771 8 0.873 6 0.819 6
②+③ 0.764 5 0.725 6 0.857 1 0.785 9
①+②+③ 0.808 9 0.778 3 0.868 1 0.820 8
神经网络 ①查询性质特征集 0.797 8 0.782 4 0.829 7 0.805 4
②搜索过程特征集 0.728 5 0.736 0 0.719 8 0.727 8
③信息来源特征集 0.703 6 0.688 4 0.752 7 0.719 1
①+② 0.811 6 0.790 8 0.851 6 0.820 1
①+③ 0.806 1 0.829 4 0.7747 0.801 1
②+③ 0.747 9 0.719 8 0.818 7 0.766 1
①+②+③ 0.817 2 0.849 4 0.774 7 0.810 3
Classification Performance of Different Features on Different Classifiers
Classification Performance of Query Property Features
Classification Performance of Search Process Features
Classification Performance of Information Source Features
Classification Performance of Query Property and Search Process Features
Classification Performance of Query Property and Information Source Features
Classification Performance of Search Process and Information Source Features
Classification Performance of Three Types Features
A Matrix of Correlations Between Individual Features
特征 Accuracy Precision Recall F1
所有新特征 0.817 2 0.849 4 0.774 7 0.810 3
去除DQT 0.817 2 0.802 1 0.846 2 0.823 6
去除DQS 0.808 9 0.789 7 0.846 2 0.817 0
去除AART 0.803 3 0.778 9 0.851 6 0.813 6
去除AFRT_new 0.817 2 0.853 7 0.769 2 0.809 3
去除AFRT_add 0.808 9 0.798 9 0.829 7 0.814 0
去除AFRT_remove 0.808 9 0.786 8 0.851 6 0.817 9
去除AFRT_replace 0.795 0 0.772 7 0.840 7 0.805 3
去除AFRT_repeat 0.803 3 0.778 9 0.851 6 0.813 6
去除QRPL 0.803 3 0.778 9 0.851 6 0.813 6
去除AVTI 0.808 9 0.786 8 0.851 6 0.817 9
去除ANRPT 0.797 8 0.771 1 0.851 6 0.809 4
去除UDP 0.817 2 0.795 9 0.857 1 0.825 4
去除UDV 0.814 4 0.810 8 0.824 2 0.817 4
Classification Performance of Removing Individual Features
