Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (5): 18-25    DOI: 10.11925/infotech.1003-3513.2014.05.03
DIGITAL LIBRARY Current Issue | Archive | Adv Search |
Research and Implementation of Bibliographic Information Classification System in LDA Model
Li Xiangdong1, Liao Xiangpeng1, Huang Li2
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Wuhan University Library, Wuhan 430072, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To improve the classification effect of bibliographic information of books and journal articles etc. [Context] The classification performance under the traditional vector space model is not satisfied, and LDA model can effectively improve the classification effect by mining the implied semantic information. [Methods] Using LDA model to represent each text with implied topics, the optimal number of topics is determined on the classification result.Then the SVM classification algorithm is used. [Results] Experiments show that the Macro_F1 in Fudan and Sogou corpus reach 95.5% and 93.5% respectively; the Macro_F1 on the real data from catalogue and electronic journal database reach 77.4% and 87.6% respectively. [Conclusions] The classification performance on real data is increased by 10% and 3% respectively compared to the VSM, that reaches the practical level.

Key wordsLatent Dirichlet Allocation      Text categorization      Vector Space Model      Gibbs sampling      Support Vector Machine     
Received: 02 January 2014      Published: 06 June 2014
:  TP181  

Cite this article:

Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model. New Technology of Library and Information Service, 2014, 30(5): 18-25.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.05.03     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I5/18

[1] Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[2] Hofmann T.Probabilistic Latent Semantic Indexing [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, California, United States. New York: ACM, 1999: 50-57.
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] 刁宇峰, 杨亮, 林鸿飞. 基于LDA模型的博客垃圾评论发现[J]. 中文信息学报, 2011, 25(1): 41-47. (Diao Yufeng, Yang Liang, Lin Hongfei. LDA-Based Opinion Spam Discovering[J]. Journal of Chinese Information Processing, 2011, 25(1): 41-47.)
[5] 黄小亮, 郁抒思, 关佶红. 基于LDA主题模型的软件缺陷分派方法[J]. 计算机工程, 2011, 37(21):46-48. (Huang Xiaoliang, Yu Shusi, Guan Jihong. Software Bug Triage Method Based on LDA Topic Model[J]. Computer Engineering, 2011, 37(21): 46-48.)
[6] 廖晓锋, 王永吉, 范修斌, 等. 基于LDA主题模型的安全漏洞分类[J]. 清华大学学报:自然科学版, 2012, 52(10): 1351-1355. (Liao Xiaofeng, Wang Yongji, Fan Xiubin, et al. National Security Vulnerability Database Classification Based on an LDA Topic Model[J]. Journal of Tsinghua University: Science and Technology, 2012, 52(10): 1351-1355.)
[7] 孙李斌, 马贤明, 赵明明. 基于LDA 主题模型的遥感图像表示与分类[J]. 科技视界, 2013(7): 58-59. (Sun Libin, Ma Xianming, Zhao Mingming. Remote Sensing Image Representation and Classification Based on LDA Topic Model[J]. Science & Technology Vision, 2013(7): 58-59.)
[8] 张志飞, 苗夺谦, 高灿. 基于LDA主体模型的短文本分类方法[J]. 计算机应用, 2013, 33(6): 1587-1590. (Zhang Zhifei, Miao Duoqian, Gao Can. Short Text Classification Using Latent Dirichlet Allocation[J]. Journal of Computer Applications, 2013, 33(6): 1587-1590.)
[9] Phan X, Nguyen M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections [C]. In: Proceedings of the 17th Conference on World Wide Web. New York: ACM, 2008: 91-100.
[10] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of the Royal Statistical Society, 1977, 39(l): 1-38.
[11] Griffiths T L, Steyvers M. Finding Scientific Topics[J].PNAS, 2004, 101(S1): 5228-5235.
[12] Griffiths T. Gibbs Sampling in the Generative Model of Latent Dirichlet Allocation [R]. Stanford University, 2002.
[13] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153. (Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[14] 曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787. (Cao Juan, Zhang Yongdong, Li Jintao, et al. A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
[15] 孙世杰, 濮建忠. 基于LDA模型的Twitter中文微博热点主题词组发现[J]. 洛阳师范学院学报, 2012, 31(11): 60-64. (Sun Shijie, Pu Jianzhong. A Hot Topic Phrase Selection Based on LDA for Chinese Tweets[J]. Journal of Luoyang Normal University, 2012, 31(11): 60-64.)

[1] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[2] Feng Hao, Li Shuqing. Multi-layer Cascade Classifier for Credit Scoring with Multiple-Support Vector Machines[J]. 数据分析与知识发现, 2021, 5(10): 28-36.
[3] Ding Shengchun,Yu Fengyang,Li Zhen. Identifying Potential Trending Topics of Online Public Opinion[J]. 数据分析与知识发现, 2020, 4(2/3): 29-38.
[4] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[5] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[6] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[7] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[8] Qingtian Zeng,Mingdi Dai,Chao Li,Hua Duan,Zhongying Zhao. Discovering Important Locations with User Representation and Trace Data[J]. 数据分析与知识发现, 2019, 3(6): 75-82.
[9] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[10] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[11] Zhixiong Zhang,Huan Liu,Liangping Ding,Pengmin Wu,Gaihong Yu. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. 数据分析与知识发现, 2019, 3(12): 1-9.
[12] Liangping Ding,Zhixiong Zhang,Huan Liu. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. 数据分析与知识发现, 2019, 3(11): 16-23.
[13] Li Xiangdong,Gao Fan,Li Youhai. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[14] Feng Guoming,Zhang Xiaodong,Liu Suhui. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[15] Huang Xiaoxi,Li Hanyu,Wang Rongbo,Wang Xiaohua,Chen Zhiqun. Recognizing Metaphor with Convolution Neural Network and SVM[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn