[Objective] To improve the classification effect of bibliographic information of books and journal articles etc. [Context] The classification performance under the traditional vector space model is not satisfied, and LDA model can effectively improve the classification effect by mining the implied semantic information. [Methods] Using LDA model to represent each text with implied topics, the optimal number of topics is determined on the classification result.Then the SVM classification algorithm is used. [Results] Experiments show that the Macro_F1 in Fudan and Sogou corpus reach 95.5% and 93.5% respectively; the Macro_F1 on the real data from catalogue and electronic journal database reach 77.4% and 87.6% respectively. [Conclusions] The classification performance on real data is increased by 10% and 3% respectively compared to the VSM, that reaches the practical level.
李湘东, 廖香鹏, 黄莉. LDA模型下书目信息分类系统的研究与实现[J]. 现代图书情报技术, 2014, 30(5): 18-25.
Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model. New Technology of Library and Information Service, 2014, 30(5): 18-25.
[1] Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[2] Hofmann T.Probabilistic Latent Semantic Indexing [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, California, United States. New York: ACM, 1999: 50-57.
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] 刁宇峰, 杨亮, 林鸿飞. 基于LDA模型的博客垃圾评论发现[J]. 中文信息学报, 2011, 25(1): 41-47. (Diao Yufeng, Yang Liang, Lin Hongfei. LDA-Based Opinion Spam Discovering[J]. Journal of Chinese Information Processing, 2011, 25(1): 41-47.)
[5] 黄小亮, 郁抒思, 关佶红. 基于LDA主题模型的软件缺陷分派方法[J]. 计算机工程, 2011, 37(21):46-48. (Huang Xiaoliang, Yu Shusi, Guan Jihong. Software Bug Triage Method Based on LDA Topic Model[J]. Computer Engineering, 2011, 37(21): 46-48.)
[6] 廖晓锋, 王永吉, 范修斌, 等. 基于LDA主题模型的安全漏洞分类[J]. 清华大学学报:自然科学版, 2012, 52(10): 1351-1355. (Liao Xiaofeng, Wang Yongji, Fan Xiubin, et al. National Security Vulnerability Database Classification Based on an LDA Topic Model[J]. Journal of Tsinghua University: Science and Technology, 2012, 52(10): 1351-1355.)
[7] 孙李斌, 马贤明, 赵明明. 基于LDA 主题模型的遥感图像表示与分类[J]. 科技视界, 2013(7): 58-59. (Sun Libin, Ma Xianming, Zhao Mingming. Remote Sensing Image Representation and Classification Based on LDA Topic Model[J]. Science & Technology Vision, 2013(7): 58-59.)
[8] 张志飞, 苗夺谦, 高灿. 基于LDA主体模型的短文本分类方法[J]. 计算机应用, 2013, 33(6): 1587-1590. (Zhang Zhifei, Miao Duoqian, Gao Can. Short Text Classification Using Latent Dirichlet Allocation[J]. Journal of Computer Applications, 2013, 33(6): 1587-1590.)
[9] Phan X, Nguyen M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections [C]. In: Proceedings of the 17th Conference on World Wide Web. New York: ACM, 2008: 91-100.
[10] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of the Royal Statistical Society, 1977, 39(l): 1-38.
[11] Griffiths T L, Steyvers M. Finding Scientific Topics[J].PNAS, 2004, 101(S1): 5228-5235.
[12] Griffiths T. Gibbs Sampling in the Generative Model of Latent Dirichlet Allocation [R]. Stanford University, 2002.
[13] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153. (Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[14] 曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787. (Cao Juan, Zhang Yongdong, Li Jintao, et al. A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
[15] 孙世杰, 濮建忠. 基于LDA模型的Twitter中文微博热点主题词组发现[J]. 洛阳师范学院学报, 2012, 31(11): 60-64. (Sun Shijie, Pu Jianzhong. A Hot Topic Phrase Selection Based on LDA for Chinese Tweets[J]. Journal of Luoyang Normal University, 2012, 31(11): 60-64.)