[Objective] This paper reviews literature-based discovery (LBD) studies, aiming to explore the latest progress, development trends and challenges in this field. [Coverage] We searched “literature-based discovery” or “literature and knowledge discovery” in Chinese and English with the Web of Science, CNKI and Baidu Academic for research published from 2010 to 2020. A total of 72 representative literature were chosen for review. [Methods] Firstly, we summarized these studies from research objects, methods and techniques, results and typical applications. We then discussed future development trends and challenges facing LBD. [Results] The research objects of LBD were becoming complicated, while the analysis methods and techniques were more intelligent. The discovery results were further enriched, which led to more LBD applications. There are some challenges facing LBD, such as multi-source heterogeneous data fusion, interpretability of knowledge discovery, evaluation of results, and collaboration of multi-disciplinary experts. [Limitations] We did not examine LBD tools / systems as well as industry applications extensively. [Conclusions] As an interdisciplinary research field of information science, informatics and data science, LBD is of great significance for mining knowledge and providing high-quality subject knowledge services.
[Objective] This paper summarizes the research development trends of information retrieval, aiming to promote interdisciplinary studies and application of related technologies. [Methods] First, we used LDA model to identify topics of papers accepted by the SIGIR Annual Conference from 2008 to 2019. Second, we removed irrelevant papers based on the similarity between documents and topics, and grouped papers into multiple categories by calculating topic discrimination. Third, we constructed the evolution path of domain topics in time series which showed the increasing, decreasing and stable patterns. Finally, we created the fine-grained evolution path of a single topic through the modular community, which demonstrated the dynamic evolution process of knowledge units within the topics. [Results] The proposed method avoids the interference of irrelevant documents on identifying topics and evolution paths. The multi-topic classification of documents helps reveal the cross-fusion among topics. The current information retrieval research trends include user-centric, continuously optimized models, filtering and recommending, semantic web technology, deep learning methods, as well as medical and health information retrieval. [Limitations] It might be subjective to remove irrelevant documents and categorize documents with multi-topics. [Conclusions] Intelligent information services is becoming a new norm, and users’ needs for information retrieval becomes more prominent.
[Objective] This paper explores issues facing topic modeling, such as lack of context, weak interpretability, and poor IPC integration. [Methods] First, we proposed the concept of context enhancement. Then, we built a Context-LDA model using both the IPC and the extracted vocabulary as training corpus at the same time. Third, we constructed our topic model with Python, and compared its generalization and topic representation abilities with traditional LDA models. [Results] We examined the proposed model with 38,354 pieces of patents of graphene. The new model had lower perplexity values (below 100), and had a strong generalization ability in different scenarios. The JS value was about 0.1 higher than the traditional LDA model. The combined IPC and the topic words represented each other and enhanced the topic readability. The average IPC position was 9.6/20 with little noise. [Limitations] The vocabulary representation under the new model needs to be expanded to n-gram from uni-gram. [Conclusions] Topic models play an important role in supporting analysis of patent topics, and more effective and accurate models should be developed based on actual needs.
[Objective] This paper analyzes online public opinion events to determine their attributes and classification. When an online public opinion event occurs, we can predict whether it will reverse in advance. This study not only helps the governments adjust the direction of public opinion in time but also protect the credibility of the governments and media. [Methods] First, we retrieved representative online public opinion events from the past five years. Then, we used the improved SMOTE algorithm to conduct a balance distribution treatment on the data set. Third, we built a prediction model for online public opinion reversal based on the neural network ensemble learning. Finally, we evaluated the model’s performance and internal mechanism with online public opinion events from 2020. [Results] The accuracy of the proposed model reached 99% and the F and AUC values were both 0.99. [Limitations] We only chose some characteristics from public opinion reversal events. Therefore, it cannot comprehensively represent all reversal events occurring in the future. [Conclusions] The constructed model can accurately predict whether or not a public opinion event will reverse.
[Objective] This paper extracts users’ opinions from videos to analyze their sentiments with the help of multi-modal methods. [Methods] First, we introduced bimodal and trimodal context information to obtain the interactions data among text, visual and audio. Then, we used attention mechanism to filter redundant information. Finally, we conducted sentiment analysis with the processed data. [Results] We examined the proposed method with MOSEI dataset. The accuracy and F1 value of sentiment classification reached 80.27% and 79.23%, which were 0.47% and 0.87% higher than the best results of the benchmark method. The mean absolute error of the regression analysis was reduced to 0.66. [Limitations] There was overfitting issue in model training due to the small size of MOSI dataset, which limited the effects of sentiment prediction. [Conclusions] The proposed model uses the interaction among different modalities and effectively improves the accuracy of sentiment prediction.
[Objective] This study explores the key factors that influence the investment decision-making behaviors of rewarded crowdfunding users. [Methods] First, we applied psychological distance theory to define text emotional distance and its three dimensions. Then, we developed a measurement index for the distance based on text analysis. Third, we constructed an econometric model to investigate the influence of text emotional distance on user’s investment decisions. Finally, we conducted an empirical analysis with 161,279 project-descriptions from Kickstarter. [Results] The positive emotional tendency, affinity and interactivity of the texts have significant positive impacts on the users’ investment decisions. Negative emotional tendency poses significant negative effects on investors. Influences of text emotional distance change with different project categories. [Limitations] Findings from our study may not be applied to other crowdfunding business. Also, the qualitative research on text language in psychology and sociology is limited due to technical issues. [Conclusions] The rewarded crowdfunding projects could improve their financing rates with positive emotional tendency and affinity in their descriptions.
[Objective] This paper analyzes online comments by professional critics and average audience, aiming to improve the sentiment classification of reviews. [Methods] First, we introduced the professional backgrounds of contributors to examine the emotional polarity of reviews. Then, we used the generative adversarial network to decide whether the contributor was a professional critic or an average browser. Finally, we identified their differences to further improve the accuracy of emotion classification. [Results] The accuracy rate of the proposed model reached 83.6%, which was 5.6% higher than the benchmark model LSTM and 4.4% higher than BiLSTM. [Limitations] We only studied movie reviews, and more research is needed to evaluate our model with data sets from other fields. [Conclusions] The proposed GJOINT model can effectively improve the results of sentiment classification of online reviews.
[Objective] This study introduces word semantics to TextRank algorithm, aiming to improve the performance of keywords extraction methods. [Methods] First, we used the semantic information from HowNet to calculate similarity of words. Then, we constructed graph and matrix for semantic words passing a similarity threshold. Finally, the semantic matrix and co-occurrence matrix were weighted to obtain transition probability matrix. [Results] The improved algorithm is better than TextRank, TF-IDF and LDA on short texts, which increased the F-scores by 6.6%, 9.0% and 10.3% respectively. On long texts, the results were inferior to TF-IDF, but close to TextRank. [Limitations] The segmentation program could not effectively identify compound words, new words and entities, which extracted incomplete keywords and reduced F-scores. In addition, the semantic similarity algorithm could also be improved. [Conclusions] The proposed method effectively extracts keywords from short texts with the help of co-occurrence and semantic relations of words.
[Objective] This paper aims to address the issues facing document management systems due to Chinese authors with the same names. [Methods] We built author entities with “author name + institution name” based on bibliographic data. Then, we used the attributes of author entities to construct six similarity features from three aspects. Third, we merged these features by principal component analysis or direct weight assignment. Finally, we evaluated the performance of the proposed method. [Results] Our methods significantly reduced processing time. Their F1 values on the LIS dataset were 70.74% and 70.42%, while their F1 values on the economics dataset were 81.90% and 80.93%. [Limitations] The attributes used in this research were only retrieved from metadata of the papers. [Conclusions] The proposed method could improve weight setting of multiple features.
[Objective] This paper proposes an adaptive recommendation model based on user’s behaviors, aiming to address the issues of one model only working for one user type. [Methods] We standardized the recommendation process with a three-tier collaborative structure. The first layer classified users to create different recommendation channels. The second layer matched the improved recommendation sub-algorithm with corresponding channels. The third layer introduced feature weighting to form a recommendation pool, from which the items were selected and recommended to users. [Results] The accuracy, recall, coverage and popularity of the proposed model were 0.24, 0.17, 0.50 and 4.40, which were better than the mainstream models. [Limitations] Our recommendation algorithm cannot work on datasets without scores. [Conclusions] The proposed model can learn the preferences of users and make better recommendations.
[Objective] This paper proposes a method to evaluate the consistency of scholarly journal article reviewers. [Methods] We developed a consistency index based on the knowledge from reviews and bibliometric data. Then, we conducted hypothesis test to examine whether experts with higher consistency scores make a more accurate evaluation of the paper. [Results] We found high-consistency experts could identify papers with academic community recognition, which was also maintained over time. [Limitations] The proposed consistency index could not replace journal editors in selecting experts, however, it helps to make reviewer selection efficiently and effectively. [Conclusions] It is feasible to calculate the consistency index based on historical data to select reviewers of scholarly articles.
[Objective] This paper tries to extract and integrate domain knowledge from heterogeneous data based on knowledge elements, aiming to enrich the semantic information of knowledge representation. [Methods] We proposed a new method to extract and represent knowledge based on the semantic description model with knowledge elements. Then, we examined our model in the field of information retrieval. [Results] We extracted 4,200 knowledge elements and 3,020 entities on information retrieval from Wikipedia and two classic textbooks. We could query the relationship between knowledge elements and their entities. [Limitations] The semantic relations among knowledge elements were not adequately explored, and the process of knowledge extraction was not fully automated. [Conclusions] This paper improves the semantics of knowledge representation, and provides new perspectives for domain knowledge service.
[Objective] This paper proposes a model for multiple-choice reading comprehension, and then explores the impacts of question types and answer length on machine reading comprehension. [Methods] First, we used the multi-perspective matching mechanism to obtain the correlation between the articles, questions and candidate answers. Then, we multiplied the correlation and articles to create the vector representation of questions and candidate answers. Third, we extracted sentence-level and document-level features, which were used to select the correct answers. Fourth, we categorized the data based on the question types and answer length. Finally, we analyzed their impacts on the machine’s choice of correct answers. [Results] The accuracy of our model on the RACE-M, RACE-H and RACE datasets reached 72.5%, 63.1% and 66.1% respectively. [Limitations] The multi-perspective matching mechanism has four matching strategies and multiple angles, which makes the model consume a lot of memory and spend longer processing time at the interactive layer. [Conclusions] The proposed model can effectively match articles with questions and answers. The accuracy of the model is more affected by the type of question, not by the length of answer.