[Objective] This paper aims to identify potential customers by analyzing user-generated contents from product-specific online forums. [Methods] First, we converted the unbalanced dataset into multiple balanced subsets. Then, we employed the Stacking classification algorithm to construct identification model. Finally, we compared results of the proposed method with five baseline algorithms. [Results] Compared to the algorithms of Bayesnet, Logistic, C4.5, SMO and Naive Bayes, the F-measure of our method was increased by 17.4%, 26.5%, 24.1%, 29.3%, and 40.9%. Compared to Stacking, Bagging and Boosting methods, our F-measure increased by 10.1%, 5.9%, 13.1%. [Limitations] We only examined performance of the proposed methods with automotive industry. [Conclusions] The proposed method could effectively identify potential customers based on user-generated contents.
[Objective] The traditional interest point recommendation methods are mostly based on simple context and can only recommend objects that are the most popular, cheapest or the closest to interest points. Combines time, category information with user’s check-in records, and make up for the shortcomings of traditional interest points recommendation methods with characteristics of user’s preference, and provide support for improving recommendation accuracy. [Methods] The interest point recommendation is considered as a sorting problem. In this paper, ESSVM (Embedded space ranking SVM) is proposed based on embedded spatial sorting support vector machine model to classify interest points according to different features. User preferences are captured using check-in data, and machine learning models are used to adjust the importance of different attributes in sorting. [Results] Compared with UserCF, VenueCF, PoV, NNR and other recommendation methods, ESSVM not only can capture individual heterogeneous preferences, but also can reduce the consumption of the training model of time. [Limitations] Collecting and integrating different contextual information from different location based social networks (LBSNs) will take a lot of work. In addition, if users reduce the granularity of time and class in ESSVM, they maybe need to solve the problem of data sparseness. [Conclusions] This method takes account of the impact of time variation on user preferences, as well as the location categories that users visit at different times. By providing useful contextual information and check-in records, it provides personalized suggestions.
[Objective] This paper aims to construct a novelty index to evaluate the academic achievements. [Methods] First, we proposed a model to calculate content eigenfactor based on deep learning (Doc2Vec) and Hidden Markov Model. Then, we built the topic novelty measure index. Finally, we examined the proposed method with academic papers published by three Chinese LIS journals in 2014. [Results] Compared with the existing methods, the proposed model measured the topic novelty more effectively. [Limitations] Our empirical research only examined abstracts of the academic papers. [Conclusions] The proposed method could help us evaluate and monitor scholarly research.
[Objective] This study aims to quantitatively examine the interdisciplinary social science research with the help of machine learning technique’s automatic classification method. [Methods] We used the KNN algorithm to classify social science papers indexed by CNKI and then proposed a new method to calculate their degree of interdisciplinarity. [Results] There was significant difference among classification results of all disciplines. We also found significant correlation between the classification results and interdisciplinarity of papers. [Limitations] More quantitative research is needed to expand the present study. [Conclusions] Machine learning could effectively identify the interdisciplinary social science studies.
[Objective] This paper develops a new clustering algorithm, aiming to automatically calculate the cut-off distance and select the cluster centers. [Methods] First, we proposed a new adaptive algorithm based on information entropy and the cut-off distance. Then, we extracted the cluster centers, with the help of inflection points determined by the slope trend of the weight in the sorting chart. Finally, we evaluated the performance of the ADPC algorithm to those of the DBSCAN, DPC, DGCCD, and ACP algorithms using UCI and manmade datasets. [Results] The ADPC algorithm automatically identified the cluster centers and significantly improved the precision, F-measure, normalized mutual information measurement and runtime. [Limitations] The proposed algorithm’s performance with high-dimension data as well as its efficiency to process large data sets need to be improved. [Conclusions] The proposed ADPC algorithm could effectively identify clustering centers and the cut-off distance with low-dimension or arbitrary data sets.
[Objective] This paper developed an image recommendation model based on feature matching technique and the LSH algorithm, aiming to improve the accuracy of recommendations. [Methods] First, we extracted the image’s SIFT features as the matching criteria. Then, we modified the LSH algorithm to retrieve images in high dimensional settings. Finally, we proposed an ICF-LSH algorithm based on the collaborative filtering techniques to build fusion recommendation model. [Results] We examined the proposed algorithm with various datasets and achieved better recall and precision rates for image recommendation. [Limitations] Only used the SIFT feature to extract image features. More research is needed to explore other matching features. [Conclusions] The proposed model improves the performance of image matching and recommendation systems.
[Objective] This paper aims to reduce the unnecessary updates and improve the accuracy of Label Propagation Algorithm. [Methods] First, we used the node information list to direct the update process and increase the execution speed. Then, we proposed new updating rules based on the node preference to improve the accuracy of community detection. [Results] Compared with the classic label propagation algorithm and two improved algorithms, the proposed one significantly reduced the number of iterations on large-scale social networks, as well as improved the value of Normalized Mutual Information and F-measure of LFR benchmark network. [Limitations] The new algorithm’s updating sequence is random, which needs to be investigated in further studies. [Conclusions] The SOCP_LPA improves the accuracy of community detection and the processing speed.
[Objective] This paper aims to improve the efficiency of topic modeling from news reports, and reduce the cost of competitive intelligence analysis. [Context] The proposed method could help competitive intelligence analysts accomplish environmental scanning tasks with the help of news reports. [Methods] First, we retrieved news stories with the help of a web crawler. Then, we categorized these articles based on a sentiment analysis API. Third, we identified and visualized news topics with the help of Latent Dirichlet Allocation method. We used Python to finish the data collection, cleansing, analyzing and visualizing jobs. [Results] We identified positive and negative sentiments as well as related keywords from news reports on the bike-sharing industry. [Conclusions] The proposed topic mining method based on sentiment analysis helps enterprises identify competitive advantages. It also improves the effectiveness of environmental scanning for competitive intelligence.
[Objective] This paper finds the potential side effects of drugs with the help of text mining, aiming to improve the contents of existing databases and early prediction of drug side effects. [Methods] A total of 100, 873 articles were retrieved from the PubMed database for about five years (2011-2016). We generated the drug side effects co-occurrence matrix and conducted gCLUTO bi-clustering analysis with Perl’s segmentation technique, named entity recognition method based on the dictionary, as well as the R language. [Results] For one category of results, we found the precision rate of the proposed method reached 75.65%, and identified 13.91% potential side effects. [Limitations] Only used the dictionary-based named entity recognition method and did not consider grammatical or lexis factors, which yielded high false positive rates. [Conclusions] This paper proposes a new approach to detect the unannounced side effects of drugs automatically and effectively.
[Objective] This study tries to extract more semantic information from the science and technology literature, aiming to identify emerging trends from the documents of fund projects. [Methods] First, we proposed a new trend detection method based on the DTM model and text analytics. Then, we identified the topic probability distribution of the fund projects and constructed a new theme detection formula based on the text features. Finally, we detected the emerging trends in the field of NSF graphene. [Results] The proposed method identified emerging trends of fund projects and provided information for technology innovation. [Limitations] We only examined the fund project documents from the perspectives of the amount, length, and theme of funding. [Conclusions] The proposed method could effectively identify emerging trends of fund projects.
[Objective] This paper tries to construct data analysis model for the topics of scientific research based on machine learning. [Methods] First, we clustered data with the Latent Dirichlet Allocation model. Then, we investigated the correlation among year, institution and research types with the help of Python modules. Finally, we revealed and visualized the key research areas of every year or institution. [Results] We analyzed 101,813 papers and patents of graphene industray research. The proposed method finished the topic identification, correlation analysis, and visualization in about two miniutes. [Limitations] More research is needed to explore the network analysis issues. [Conclusions] Machine learning provides enormous potentiality for intelligence studies, especially the large volume text analytics and visualization.