Current Issue
    , Volume 2 Issue 3 Previous Issue    Next Issue
    For Selected: View Abstracts Toggle Thumbnails
    Identifying Potential Customers Based on User-Generated Contents
    Jiang Cuiqing,Song Kailun,Ding Yong,Liu Yao
    2018, 2 (3): 1-8.  DOI: 10.11925/infotech.2096-3467.2017.0849
    Abstract   HTML ( 14 PDF(601KB) ( 369 )  

    [Objective] This paper aims to identify potential customers by analyzing user-generated contents from product-specific online forums. [Methods] First, we converted the unbalanced dataset into multiple balanced subsets. Then, we employed the Stacking classification algorithm to construct identification model. Finally, we compared results of the proposed method with five baseline algorithms. [Results] Compared to the algorithms of Bayesnet, Logistic, C4.5, SMO and Naive Bayes, the F-measure of our method was increased by 17.4%, 26.5%, 24.1%, 29.3%, and 40.9%. Compared to Stacking, Bagging and Boosting methods, our F-measure increased by 10.1%, 5.9%, 13.1%. [Limitations] We only examined performance of the proposed methods with automotive industry. [Conclusions] The proposed method could effectively identify potential customers based on user-generated contents.

    Figures and Tables | References | Related Articles | Metrics
    Classification Recommendation Based on ESSVM
    Hou Jun,Liu Kui,Li Qianmu
    2018, 2 (3): 9-21.  DOI: 10.11925/infotech.2096-3467.2017.1123
    Abstract   HTML ( 6 PDF(2792KB) ( 154 )  

    [Objective] The traditional interest point recommendation methods are mostly based on simple context and can only recommend objects that are the most popular, cheapest or the closest to interest points. Combines time, category information with user’s check-in records, and make up for the shortcomings of traditional interest points recommendation methods with characteristics of user’s preference, and provide support for improving recommendation accuracy. [Methods] The interest point recommendation is considered as a sorting problem. In this paper, ESSVM (Embedded space ranking SVM) is proposed based on embedded spatial sorting support vector machine model to classify interest points according to different features. User preferences are captured using check-in data, and machine learning models are used to adjust the importance of different attributes in sorting. [Results] Compared with UserCF, VenueCF, PoV, NNR and other recommendation methods, ESSVM not only can capture individual heterogeneous preferences, but also can reduce the consumption of the training model of time. [Limitations] Collecting and integrating different contextual information from different location based social networks (LBSNs) will take a lot of work. In addition, if users reduce the granularity of time and class in ESSVM, they maybe need to solve the problem of data sparseness. [Conclusions] This method takes account of the impact of time variation on user preferences, as well as the location categories that users visit at different times. By providing useful contextual information and check-in records, it provides personalized suggestions.

    Figures and Tables | References | Related Articles | Metrics
    Measuring Novelty of Scholarly Articles
    Lu Wanhui,Tan Zongying
    2018, 2 (3): 22-29.  DOI: 10.11925/infotech.2096-3467.2017.1012
    Abstract   HTML ( 6 PDF(952KB) ( 305 )  

    [Objective] This paper aims to construct a novelty index to evaluate the academic achievements. [Methods] First, we proposed a model to calculate content eigenfactor based on deep learning (Doc2Vec) and Hidden Markov Model. Then, we built the topic novelty measure index. Finally, we examined the proposed method with academic papers published by three Chinese LIS journals in 2014. [Results] Compared with the existing methods, the proposed model measured the topic novelty more effectively. [Limitations] Our empirical research only examined abstracts of the academic papers. [Conclusions] The proposed method could help us evaluate and monitor scholarly research.

    Figures and Tables | References | Related Articles | Metrics
    Identifying Interdisciplinary Social Science Research Based on Article Classification
    Liu Liu,Wang Dongbo
    2018, 2 (3): 30-38.  DOI: 10.11925/infotech.2096-3467.2017.0822
    Abstract   HTML ( 3 PDF(622KB) ( 292 )  

    [Objective] This study aims to quantitatively examine the interdisciplinary social science research with the help of machine learning technique’s automatic classification method. [Methods] We used the KNN algorithm to classify social science papers indexed by CNKI and then proposed a new method to calculate their degree of interdisciplinarity. [Results] There was significant difference among classification results of all disciplines. We also found significant correlation between the classification results and interdisciplinarity of papers. [Limitations] More quantitative research is needed to expand the present study. [Conclusions] Machine learning could effectively identify the interdisciplinary social science studies.

    Figures and Tables | References | Related Articles | Metrics
    A Clustering Algorithm with Adaptive Cut-off Distance and Cluster Centers
    Yang Zhen,Wang Hongjun,Zhou Yu
    2018, 2 (3): 39-48.  DOI: 10.11925/infotech.2096-3467.2017.0889
    Abstract   HTML ( 8 PDF(2184KB) ( 535 )  

    [Objective] This paper develops a new clustering algorithm, aiming to automatically calculate the cut-off distance and select the cluster centers. [Methods] First, we proposed a new adaptive algorithm based on information entropy and the cut-off distance. Then, we extracted the cluster centers, with the help of inflection points determined by the slope trend of the weight in the sorting chart. Finally, we evaluated the performance of the ADPC algorithm to those of the DBSCAN, DPC, DGCCD, and ACP algorithms using UCI and manmade datasets. [Results] The ADPC algorithm automatically identified the cluster centers and significantly improved the precision, F-measure, normalized mutual information measurement and runtime. [Limitations] The proposed algorithm’s performance with high-dimension data as well as its efficiency to process large data sets need to be improved. [Conclusions] The proposed ADPC algorithm could effectively identify clustering centers and the cut-off distance with low-dimension or arbitrary data sets.

    Figures and Tables | References | Related Articles | Metrics
    Recommending Image Based on Feature Matching
    Liu Dongsu,Huo Chenhui
    2018, 2 (3): 49-59.  DOI: 10.11925/infotech.2096-3467.2017.1023
    Abstract   HTML ( 2 PDF(1918KB) ( 314 )  

    [Objective] This paper developed an image recommendation model based on feature matching technique and the LSH algorithm, aiming to improve the accuracy of recommendations. [Methods] First, we extracted the image’s SIFT features as the matching criteria. Then, we modified the LSH algorithm to retrieve images in high dimensional settings. Finally, we proposed an ICF-LSH algorithm based on the collaborative filtering techniques to build fusion recommendation model. [Results] We examined the proposed algorithm with various datasets and achieved better recall and precision rates for image recommendation. [Limitations] Only used the SIFT feature to extract image features. More research is needed to explore other matching features. [Conclusions] The proposed model improves the performance of image matching and recommendation systems.

    Figures and Tables | References | Related Articles | Metrics
    A Label Propagation Algorithm Based on Speed Optimization and Community Preference
    Zhang Suqi,Gao Xing,Huo Shijie,Guo Jingjin,Gu Junhua
    2018, 2 (3): 60-69.  DOI: 10.11925/infotech.2096-3467.2017.0964
    Abstract   HTML ( 2

    [Objective] This paper aims to reduce the unnecessary updates and improve the accuracy of Label Propagation Algorithm. [Methods] First, we used the node information list to direct the update process and increase the execution speed. Then, we proposed new updating rules based on the node preference to improve the accuracy of community detection. [Results] Compared with the classic label propagation algorithm and two improved algorithms, the proposed one significantly reduced the number of iterations on large-scale social networks, as well as improved the value of Normalized Mutual Information and F-measure of LFR benchmark network. [Limitations] The new algorithm’s updating sequence is random, which needs to be investigated in further studies. [Conclusions] The SOCP_LPA improves the accuracy of community detection and the processing speed.

    Figures and Tables | References | Related Articles | Metrics
    Mining News on Competitors with Sentiment Classification
    Wang Shuyi,Liao Huatao,Wu Chake
    2018, 2 (3): 70-78.  DOI: 10.11925/infotech.2096-3467.2017.0997
    Abstract   HTML ( 18 PDF(4337KB) ( 429 )  

    [Objective] This paper aims to improve the efficiency of topic modeling from news reports, and reduce the cost of competitive intelligence analysis. [Context] The proposed method could help competitive intelligence analysts accomplish environmental scanning tasks with the help of news reports. [Methods] First, we retrieved news stories with the help of a web crawler. Then, we categorized these articles based on a sentiment analysis API. Third, we identified and visualized news topics with the help of Latent Dirichlet Allocation method. We used Python to finish the data collection, cleansing, analyzing and visualizing jobs. [Results] We identified positive and negative sentiments as well as related keywords from news reports on the bike-sharing industry. [Conclusions] The proposed topic mining method based on sentiment analysis helps enterprises identify competitive advantages. It also improves the effectiveness of environmental scanning for competitive intelligence.

    Figures and Tables | References | Related Articles | Metrics
    Using Text Mining to Discover Drug Side Effects: Case Study of PubMed
    Fan Xinyue,Cui Lei
    2018, 2 (3): 79-86.  DOI: 10.11925/infotech.2096-3467.2017.1047
    Abstract   HTML ( 7 PDF(1348KB) ( 365 )  

    [Objective] This paper finds the potential side effects of drugs with the help of text mining, aiming to improve the contents of existing databases and early prediction of drug side effects. [Methods] A total of 100, 873 articles were retrieved from the PubMed database for about five years (2011-2016). We generated the drug side effects co-occurrence matrix and conducted gCLUTO bi-clustering analysis with Perl’s segmentation technique, named entity recognition method based on the dictionary, as well as the R language. [Results] For one category of results, we found the precision rate of the proposed method reached 75.65%, and identified 13.91% potential side effects. [Limitations] Only used the dictionary-based named entity recognition method and did not consider grammatical or lexis factors, which yielded high false positive rates. [Conclusions] This paper proposes a new approach to detect the unannounced side effects of drugs automatically and effectively.

    Figures and Tables | References | Related Articles | Metrics
    Detecting Emerging Trends of Funds Based on DTM Model and Text Analytics: Case Study of NSF Graphene Field
    Xu Lulu,Wang Xiaoyue,Bai Rujiang,Zhou Yanting
    2018, 2 (3): 87-97.  DOI: 10.11925/infotech.2096-3467.2017.1085
    Abstract   HTML ( 3 PDF(1658KB) ( 182 )  

    [Objective] This study tries to extract more semantic information from the science and technology literature, aiming to identify emerging trends from the documents of fund projects. [Methods] First, we proposed a new trend detection method based on the DTM model and text analytics. Then, we identified the topic probability distribution of the fund projects and constructed a new theme detection formula based on the text features. Finally, we detected the emerging trends in the field of NSF graphene. [Results] The proposed method identified emerging trends of fund projects and provided information for technology innovation. [Limitations] We only examined the fund project documents from the perspectives of the amount, length, and theme of funding. [Conclusions] The proposed method could effectively identify emerging trends of fund projects.

    Figures and Tables | References | Related Articles | Metrics
    Visualizing Document Correlation Based on LDA Model
    Wang Li,Zou Lixue,Liu Xiwen
    2018, 2 (3): 98-106.  DOI: 10.11925/infotech.2096-3467.2017.1058
    Abstract   HTML ( 16 PDF(4133KB) ( 298 )  

    [Objective] This paper tries to construct data analysis model for the topics of scientific research based on machine learning. [Methods] First, we clustered data with the Latent Dirichlet Allocation model. Then, we investigated the correlation among year, institution and research types with the help of Python modules. Finally, we revealed and visualized the key research areas of every year or institution. [Results] We analyzed 101,813 papers and patents of graphene industray research. The proposed method finished the topic identification, correlation analysis, and visualization in about two miniutes. [Limitations] More research is needed to explore the network analysis issues. [Conclusions] Machine learning provides enormous potentiality for intelligence studies, especially the large volume text analytics and visualization.

    Figures and Tables | References | Related Articles | Metrics
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn