Choosing Stopwords for Patent Topic Analysis Based on Auxiliary Set
Yu Yan1,2(), Zhao Naixuan1
1Information Service Department, Nanjing Tech University, Nanjing 210009, China 2Department of Computer Engineering, Southeast University Chengxian College, Nanjing 211816, China
[Objective] This paper proposes a new method to automatically choose domain specific stopwords, aiming to improve the performance of patent topic analysis. [Methods] First, we introduced an auxiliary set and proposed two indexes of document frequency and entropies among categories based on this auxiliary set. Then, we measured the distribution of words from the auxiliary set to choose the domain specific stopwords automatically. [Results] The proposed method improved the quality of identified patent topics. [Limitations] The types and members of the auxiliary set need to be further studied. [Conclusions] The proposed stopwords selection methods could measure the characteristics of words, which helps us find the domain specific stopwords for patent analysis more effectively.
Tang J, Wang B, Yang Y, et al.PatentMiner: Topic-driven Patent Analysis and Mining[C]//Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China. New York: ACM Press, 2012: 1366-1374.
[2]
Wang B, Liu S, Ding K, et al.Identifying Technological Topics and Institution-topic Distribution Probability for Patent Competitive Intelligence Analysis: A Case Study in LTE Technology[J]. Scientometrics, 2014, 101(1): 685-704.
doi: 10.1007/s11192-014-1342-3
[3]
Chen H, Zhang G, Lu J, et al.A Fuzzy Approach for Measuring Development of Topics in Patents Using Latent Dirichlet Allocation[C]//Proceedings of IEEE International Conference on Fuzzy Systems, Istanbul, Turkey. Washington DC:IEEE Computer Society, 2015.
[4]
Kim M, Park Y, Yoon J.Generating Patent Development Maps for Technology Monitoring Using Semantic Patent-topic Analysis[J]. Computers & Industrial Engineering, 2016, 98(3): 289-299.
doi: 10.1016/j.cie.2016.06.006
[5]
Suominen A, Toivanen H, Seppänen M.Firms’ Knowledge Profiles: Mapping Patent Data with Unsupervised Learning[J]. Technological Forecasting & Social Change, 2016, 115: 131-142.
doi: 10.1016/j.techfore.2016.09.028
(Fan Yu, Fu Hongguang, Wen Yi.Patent Information Clustering Technique Based on Latent Dirichlet Allocation Model[J]. Journal of Computer Applications, 2013, 33(1): 87-89.)
(Wu Feifei, Zhang Yaru, Huang Lucheng, et al.Multi-dimension Dynamic Evolution Analysis of Technology Topics Based on AToT by Taking Grapheme Technology as an Example[J]. Library and Information Service, 2017, 61(5): 95-102.)
doi: 10.13266/j.issn.0252-3116.2017.05.013
(Liao Liefa, Le Fugang.Research on Patent Technology Evolution Based on LDA Model and Classification Number[J]. Modern Information, 2017, 37(5): 13-18.)
doi: 10.3969/j.issn.1008-0821.2017.05.003
(Chen Liang, Zhang Jing, Zhang Haichao, et al.Application of Hierarchical Topic Model on Technological Evolution Analysis[J]. Library and Information Service, 2017, 61(5): 103-108.)
doi: 10.13266/j.issn.0252-3116.2017.05.014
[11]
Frakes W B, Baeza-Yates R.Information Retrieval: Data Structures and Algorithms[M]. Prentice-Hall, 1992.
[12]
Silva C, Ribeiro B.The Importance of Stop Word Removal on Recall Values in Text Categorization[C] //Proceedings of International Joint Conference on Neural Networks, Portland. Washington DC: IEEE Computer Society, 2003: 1661-1666.
(Guan Qin, Deng Sanhong, Wang Hao.Chinese Stopwords for Text Clustering: A Comparative Study[J]. Data Analysis and Knowledge Discovery, 2017, 1(3): 72-80.)
[14]
Crow D, Desanto J.A Hybrid Approach to Concept Extraction and Recognition-based Matching in the Domain of Human Resources[C]//Proceedings of IEEE International Conference on TOOLS with Artificial Intelligence, Boca Raton, USA. Washington DC: IEEE Computer Society, 2004: 535-541.
[15]
Seki K, Mostafa J.An Application of Text Categorization Methods to Gene Ontology Annotation[C]// Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil. New York: ACM Press, 2005: 138-145.
[16]
Tong S, Lerner U, Singhal A, et al.Locating Meaningful Stopwords or Stop-phrases in Keyword-based Retrieval Systems: US: 9817920[P/OL]. [2012-07-03].[2017-11-14]. .
[17]
White B J.Impact of Domain-specific Stop-word Lists on ECommerce Website Search Performance[J]. Journal of Strategic E-Commerce, 2007, 5(2): 83-102.
[18]
Lo T W, He B, Ounis I.Automatically Building a Stopword List for an Information Retrieval System[J]. Journal of Digital Information Management, 2005, 3(1): 3-8.
[19]
Hao L, Hao L.Automatic Identification of Stop Words in Chinese Text Classification[C]//Proceedings of International Conference on Computer Science and Software Engineering. Washington DC:IEEE Computer Society, 2008: 718-722.
[20]
Sinka M P, Corne D W.Evolving Better Stoplists for Document Clustering and Web Intelligence[C]// Proceedings of the 3rd International Conference on Hybrid Intelligent Systems, Melbourne, Australia. Amsterdam: IOS Press, 2008: 1015-1023.
[21]
Jungiewicz M, Łopuszyński M.Unsupervised Keyword Extraction from Polish Legal Texts[C]// Proceedings of the International Conference on Natural Language Processing, Warsaw, Poland. New York: Springer Publishing Company, 2014: 65-70.
[22]
Makrehchi M, Kamel M S.Extracting Domain-specific Stopwords for Text Classifiers[J]. Intelligent Data Analysis, 2017, 21(1): 39-62.
doi: 10.3233/IDA-150390
(Gu Yijun, Fan Xiaozhong, Wang Jianhua, et al.Automatic Selection of Chinese Stoplist[J]. Transactions of Beijing Institute of Technology, 2005, 25(4): 337-340.)
doi: 10.3969/j.issn.1001-0645.2005.04.014
(Gong Zheng, Guan Gaowa.A Comparative Study on Between Mongolian Stop Words and English Stop Words[J]. Journal of Chinese Information Processing, 2011, 25(4): 35-38.)
doi: 10.7666/d.y1887441
(Zhu Jie, Li Tianrui.Research on Tibetan Stop Words Selection and Automatic Processing Method[J]. Journal of Chinese Information Processing, 2015, 29(2): 125-132.)
doi: 10.3969/j.issn.1003-0077.2015.02.015
[26]
Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3(1): 993-1022.