[Objective] This paper explores new methods for deep subject knowledge discovery using multi-source heterogeneous data. [Methods] First, we constructed a SPO semantic network of literature to create the core domain knowledge graph. Then, we implemented multi-source heterogeneous data fusion through “entity alignment, concept level fusion and relationship fusion” to obtain the whole domain knowledge graph. Finally, we discovered deep subject knowledge with the help of this knowledge graph. We examined our method with data on Hematopoietic Stem Cell for Cancer Treatment (HSCCT). [Results] This paper proposed a knowledge graph-based framework for subject knowledge discovery (KGSKD), which fuses multi-source heterogeneous data multi-dimensionally and fine-grainedly, enriches semantic relationships among data, and supports knowledge discovery techniques such as knowledge inference, pathfinder, and link prediction natively. [Limitations] KGSKD has some limitations including data supersaturation, poor interpretability of knowledge discovery results and difficulty in communicating with domain experts. [Conclusions] KGSKD has the advantages of “richer data types”, “more comprehensive knowledge linkage”, “more advanced mining methods” and “deeper discovery results”, which effectively supports research and services of deep knowledge discovery in life sciences and medicine.
[Objective] This paper tries to automatically identify the hypernym-hyponym relations of domain concepts and establish their ontology. [Methods] First, we combined the traditional unsupervised pattern-based method and the advanced supervised-based projection learning method to automatically extract domain concepts. Then, we examined our new method with an empirical study. [Results] The proposed method could identify the hypernym sets of domain concepts. The identification accuracy in medical and general fields, as well as with the benchmark dataset BLESS were 0.88, 0.83, and 0.85 respectively. [Limitations] More research is needed to reduce the weight of high-frequency top-level words and improve the corpus quality. There are also some misidentified relationships. [Conclusions] The proposed model could find hypernym with different meanings for the same concept, which could also extract low-frequency words and named entities.
[Objective] This paper proposes a new method using hierarchical attention network, aiming to effectively recognize structure functions of scholarly articles. [Methods] First, we constructed a network model with different-grained hierarchical attention to automatically identify the functions of text structures. Then, we examined the performance of our method with four datasets from PLoS. Same tests were also applied to traditional machine learning models with text feature vectors, as well as and Bert model. We also modified the proposed model in accordance with test results. Third, we evaluated the performance of the new model with articles from Atmospheric Chemistry and Physics and decided the compatibility of this model for other domains. [Results] At the sentence level, our model (using Bi-LSTM+Attention as the encoder) outperformed the others (Macro-F1: 0.866 1). However, this model did not perform well in un-related fields (minimum Macro-F1: 0.455 4). [Limitations] The model cannot recognize functions of mixed structure texts, as well as the logical relationship in these structures. [Conclusions] The proposed model could effectively recognize the structure functions at sentence level, which expands research of the full text scholarly literature.
[Objective] This paper develops an automatic method for classification indexing, aiming to better manage massive information resources and conduct knowledge discovery. [Methods] First, we analyzed the relationship between keywords (e.g., subject terms/concepts) and classification numbers. Then, we designed a multi-factor weighted algorithm. Finally, we proposed a scheme for automatic classification indexing. [Results] We examined our method with annotated corpora of authoritative domains and standard data sets. For literature with single subject classification number, the precision, recall and F values were 84.1%, 79.8%, and 81.9% respectively. For literature with two subject classification numbers, the precision, recall and F values were 83.4%, 78.8%, and 81.0%. [Limitations] The accuracy and completeness of our method relies on high-quality corpora, and the indexing of interdisciplinary literature needs to be improved. [Conclusions] The proposed method could effectively finish the classification tasks.
[Objective] This paper proposes a model to predict online ratings with the help of network representation learning and XGBoost—N2V_XGB. [Methods] First, we retrieved metadata and existing online rating data. Then, we extracted and merged the similarity weights of collected data to construct a homogenous relationship network. Third, we used network representation learning to automatically extract user and item features. Finally, we input these data to XGBoost, and obtained the best model with iteratively training. [Results] The MAE and RMSE of the proposed N2V_XGB model were 0.686 7 and 0.873 7, which were lower than the four classic models. [Limitations] We did not make good use of time features and the prediction results did not reflect time-series changes. [Conclusions] The proposed N2V_XGB model effectively address the data sparseness issues and improve the prediction accuracy of user ratings.
[Objective] This paper uses the deep learning method to predict possible readmissions of patients based on their electronic medical records, aiming to improve hospital management. [Methods] We proposed a model based on character-level convolution neural network to process the unstructured texts. Then, with the help of structured data (demographics, clinical records and administrative data) to predict the hospital readmission cases. [Results] The deep learning model combining structured and unstructured data yielded better prediction results at F1-score of 0.735. Compared with the models only using structured or unstructured data, the F1-score was increased by 12.9% and 2.1%, respectively. [Limitations] The experimental medical records were collected from one hospital, which has some impacts on prediction results. [Conclusions] The proposed model provides references for researchers of hospital readmission prediction and hospital administrators.
[Objective] This paper examines the influences of clue consistency on users’ booking decisions on shared accommodations. [Methods] First, based on the Clue Consistency Theory, we constructed a research model from the perspective of User-Generated Content (UGC) and Marketer-Generated Content (MGC). Then, we conducted an empirical study on data collected from Xiaozhu.com - a well-known short-term renting website in China. Finally, we examined the impacts of clue consistency on renters’ purchase decisions. [Results] The purchase decision of tenants was positively correlated to the text clues of UGC and warm color pictures of MGC. Also, the information consistency between UGC and MGC posed significant positive impacts on purchase decisions. [Limitations] More image parameters need to be extracted in future research, which will help us identify home styles. [Conclusions] This study could help shared accommodation platforms and landlords improve their services.
[Objective] This paper tries to predict the daily number of theft activities. [Methods] We used LSTM network to analyze theft data from a large city in north China. First, we retrieved our data from January 1, 2005 to February 24, 2007 and from January 1, 2009 to January 7, 2011, respectively. Then, we set three different cases to examine the time series prediction of the daily number. Finally, we compared our results with those of ARIMA, Support Vector Regression, Random Forest and XGBoost with the same data set. [Results] The percentage root mean square error (PRMSE) of our model were 18.4%, 11.7% and 41.9%, respectively, which were better than those of ARIMA, Support Vector Regression, Random Forest or XGBoost model. [Limitations] More research is needed to predict the period when the number of theft crimes fluctuates dramatically. [Conclusions] The proposed model could improve the decision makings for community safety, police patrol and other specific missions.
[Objective] This paper constructs a topic drift index for trending network events, aiming to describe their changing topics. [Methods] We used the LDA model to extract topics of online trending events and analyzed their drifts with word weights. Then, we proposed procedures for constructing topic drift index. Finally, we took “Gao Yixiang pass away” as the sample for an empirical analysis. [Results] In the early stage of our case, the number of topics increased from 11 to 18, and the topic drift index was 41%, which then fell to 22%. Finally, the number of topics was reduced to 5 and the topic drift index turned to -41%. [Limitations] The proposed method could not effectively generate early-warnings for small number of topic changes and multimedia contents. It cannot detect changes of topic semantics. [Conclusions] The topic drift index for trending network events could predict the timing of online public opinion outbreaks and their recurrences.
[Objective] This paper tries to predict the usefulness of crowd testing reports with author attributes, text features, and image features. [Methods] First, we adopted deep learning techniques to extract text and image features from crowd testing reports. Then, we constrcuted a prediction model with full-connected neural network. Third, we trained the new model with 80% of samples and different input combinations. Finally, we examined our model’s performance with the remaining samples. [Results] With the help of text or image features, the prediction accuracy of the model increased by 4.24% and 5.21%, respectively. Using both the text and image features, our model’s prediction accuracy increased by 6.96%. [Limitations] The extracted features of texts and images were not understandable and interpretable. Therefore, we cannot identify specific features represented by each layer of neural network in the model. [Conclusions] The proposed model with text and image features can effectively predict the usefulness of crowd testing reports.
[Objective] This paper builds a dictionary for defective products, aiming to helps users better understand the latest developments of specific domains. [Methods] First, we extracted domain-related phrases from the corpus using word frequency features. Then, we reduced manual labeling work with the help of the TF-IDF algorithm. Finally, we proposed a Convolutional Neural Network (CNN) model using semantic and position information to generate the domain dictionary. [Results] Compared with the statistical learning method, our model improved the accuracy, recall and F1 values by 6%~9%. [Limitations] More research is needed to examine our method in other fields. [Conclusions] The proposed CNN-based method could effectively construct a dictionary for defective products.
[Objective] This paper explores the characteristics of a city portrait’s evolution based on visitor’s cognitive data with time attributes. [Methods] First, we chose the urban tourism industry as our research subject. Then, we developed a method using the LDA model and multi-dimensional theme description framework for the city. Finally, we revealed the changing trends of the city portraits from three perspectives: the theme development process, as well as the theme evolution trends in the first and second feature dimensions. [Results] We examined our new model with Hong Kong and found its urban tourism portraits showed no significant periodic changes. However, the tourists’ perceptions on Hong Kong always had the primary and secondary dimensions. Sightseeing, transportation and entertainment were the main factors of tourists’ perceptions on Hong Kong. Specifically, sightseeing was the most important one during the entire process, entertainment was mainly in the early and late stages, and transportation was more likely at the middle stage. We also found each topic node in the evolutionary path had a stable iconic. [Limitations] We need to evaluate our method with other cities. [Conclusions] Our research will benefit urban planning and policy implementation.