Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (5): 18-25    DOI: 10.11925/infotech.1003-3513.2014.05.03
DIGITAL LIBRARY Current Issue | Archive | Adv Search |
Research and Implementation of Bibliographic Information Classification System in LDA Model
Li Xiangdong1, Liao Xiangpeng1, Huang Li2
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Wuhan University Library, Wuhan 430072, China
Download: PDF(1706 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To improve the classification effect of bibliographic information of books and journal articles etc. [Context] The classification performance under the traditional vector space model is not satisfied, and LDA model can effectively improve the classification effect by mining the implied semantic information. [Methods] Using LDA model to represent each text with implied topics, the optimal number of topics is determined on the classification result.Then the SVM classification algorithm is used. [Results] Experiments show that the Macro_F1 in Fudan and Sogou corpus reach 95.5% and 93.5% respectively; the Macro_F1 on the real data from catalogue and electronic journal database reach 77.4% and 87.6% respectively. [Conclusions] The classification performance on real data is increased by 10% and 3% respectively compared to the VSM, that reaches the practical level.

Key wordsLatent Dirichlet Allocation      Text categorization      Vector Space Model      Gibbs sampling      Support Vector Machine     
Received: 02 January 2014      Published: 06 June 2014
:  TP181  

Cite this article:

Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model. New Technology of Library and Information Service, 2014, 30(5): 18-25.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.05.03     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I5/18

[1] Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[2] Hofmann T.Probabilistic Latent Semantic Indexing [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, California, United States. New York: ACM, 1999: 50-57.
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] 刁宇峰, 杨亮, 林鸿飞. 基于LDA模型的博客垃圾评论发现[J]. 中文信息学报, 2011, 25(1): 41-47. (Diao Yufeng, Yang Liang, Lin Hongfei. LDA-Based Opinion Spam Discovering[J]. Journal of Chinese Information Processing, 2011, 25(1): 41-47.)
[5] 黄小亮, 郁抒思, 关佶红. 基于LDA主题模型的软件缺陷分派方法[J]. 计算机工程, 2011, 37(21):46-48. (Huang Xiaoliang, Yu Shusi, Guan Jihong. Software Bug Triage Method Based on LDA Topic Model[J]. Computer Engineering, 2011, 37(21): 46-48.)
[6] 廖晓锋, 王永吉, 范修斌, 等. 基于LDA主题模型的安全漏洞分类[J]. 清华大学学报:自然科学版, 2012, 52(10): 1351-1355. (Liao Xiaofeng, Wang Yongji, Fan Xiubin, et al. National Security Vulnerability Database Classification Based on an LDA Topic Model[J]. Journal of Tsinghua University: Science and Technology, 2012, 52(10): 1351-1355.)
[7] 孙李斌, 马贤明, 赵明明. 基于LDA 主题模型的遥感图像表示与分类[J]. 科技视界, 2013(7): 58-59. (Sun Libin, Ma Xianming, Zhao Mingming. Remote Sensing Image Representation and Classification Based on LDA Topic Model[J]. Science & Technology Vision, 2013(7): 58-59.)
[8] 张志飞, 苗夺谦, 高灿. 基于LDA主体模型的短文本分类方法[J]. 计算机应用, 2013, 33(6): 1587-1590. (Zhang Zhifei, Miao Duoqian, Gao Can. Short Text Classification Using Latent Dirichlet Allocation[J]. Journal of Computer Applications, 2013, 33(6): 1587-1590.)
[9] Phan X, Nguyen M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections [C]. In: Proceedings of the 17th Conference on World Wide Web. New York: ACM, 2008: 91-100.
[10] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of the Royal Statistical Society, 1977, 39(l): 1-38.
[11] Griffiths T L, Steyvers M. Finding Scientific Topics[J].PNAS, 2004, 101(S1): 5228-5235.
[12] Griffiths T. Gibbs Sampling in the Generative Model of Latent Dirichlet Allocation [R]. Stanford University, 2002.
[13] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153. (Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[14] 曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787. (Cao Juan, Zhang Yongdong, Li Jintao, et al. A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
[15] 孙世杰, 濮建忠. 基于LDA模型的Twitter中文微博热点主题词组发现[J]. 洛阳师范学院学报, 2012, 31(11): 60-64. (Sun Shijie, Pu Jianzhong. A Hot Topic Phrase Selection Based on LDA for Chinese Tweets[J]. Journal of Luoyang Normal University, 2012, 31(11): 60-64.)

[1] Qingtian Zeng,Mingdi Dai,Chao Li,Hua Duan,Zhongying Zhao. Discovering Important Locations with User Representation and Trace Data[J]. 数据分析与知识发现, 2019, 3(6): 75-82.
[2] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[3] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[4] Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[5] Guoming Feng,Xiaodong Zhang,Suhui Liu. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[6] Xiaoxi Huang,Hanyu Li,Rongbo Wang,Xiaohua Wang,Zhiqun Chen. Recognizing Metaphor with Convolution Neural Network and SVM[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[7] Rujiang Bai,Fuhai Leng,Junhua Liao. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
[8] Jin Zeng,Wei Lu,Heng Ding,Haihua Chen. Modeling User’s Interests Based on Image Semantics[J]. 数据分析与知识发现, 2017, 1(4): 76-83.
[9] Shihai Tian,Deli Lyu. An Early Warning Algorithm for Public Opinion of Safety Emergency[J]. 数据分析与知识发现, 2017, 1(2): 11-18.
[10] Shuang Yang,Fen Chen. Analyzing Sentiments of Micro-blog Posts Based on Support Vector Machine[J]. 数据分析与知识发现, 2017, 1(2): 73-79.
[11] Zhang Ye,Zhang Han,Yin Bincan,Zhao Yuhong. Building Disease Prediction Model Using Support Vector Machine ——Case Study of Severe Acute Pancreatitis[J]. 现代图书情报技术, 2016, 32(2): 83-89.
[12] Hong Ma, Yongming Cai. A CA-LDA Model for Chinese Topic Analysis: Case Study of Transportation Law Literature[J]. 数据分析与知识发现, 2016, 32(12): 17-26.
[13] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[14] Xu Yuemei,Li Yang,Liang Ye,Cai Lianqiao. Analyzing Evolution of News Topics with Manifold Learning[J]. 现代图书情报技术, 2016, 32(10): 59-69.
[15] Ruyi Yang,Dongsu Liu,Hui Li. An Improved Topic Model Integrating Extra-Features[J]. 现代图书情报技术, 2016, 32(1): 48-54.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn