Home Browse Highlights

Highlights

Please wait a minute...
  • Select all
    |
  • Li Ying, Li Ming
    Data Analysis and Knowledge Discovery. 2024, 8(10): 89-99. https://doi.org/10.11925/infotech.2096-3467.2023.0683
    Abstract (106) PDF (16) HTML (28)   Knowledge map   Save

    [Objective] This paper proposes a recommendation method for supplementary question-and-answer (Q&A) based on a multi-label, multi-document Q&A classification model enhanced by transfer learning. It aims to identify and recommend supplementary answers in online Q&A communities. [Methods] We introduced new features alongside existing ones to classify the supplementary relationships between questions and answers. Then, we established a transfer learning-enhanced multi-label, multi-document classification model to identify and recommend supplementary answers. [Results] We conducted three meta-tasks on real datasets from the Zhihu community. The proposed method improves precision, recall, and F1 score by 48.29%, 15.75%, and 32.53%, respectively, on average. [Limitations] The method was only applied to health-related Q&A topics in Zhihu and has yet to be validated across different platforms or topics. [Conclusions] The proposed recommendation method effectively recommends supplementary answers. It helps users in Q&A communities obtain more comprehensive answers and promote knowledge utilization within the community.

  • He Jun, Yu Jianjun, Rong Xiaohui
    Data Analysis and Knowledge Discovery. 2024, 8(10): 136-145. https://doi.org/10.11925/infotech.2096-3467.2023.0645
    Abstract (105) PDF (175) HTML (29)   Knowledge map   Save

    [Objective] This paper aims to ensure the objectivity, timeliness, and accuracy of the overall budget performance evaluation of research institutions, and to improve the efficiency of performance evaluation work. [Methods] We proposed a method for predicting research institutions’ overall budget performance evaluation based on LightGBM. Our method integrates various data from scientific research management information systems. It uses machine learning algorithms to analyze and predict the overall budget performance evaluation results by correlating research inputs and outputs with performance. [Results] In the application of the overall budget performance evaluation of research institutions, the accuracy of the proposed method reached 94.12%. The human resources required for the budget performance evaluation process were reduced from 10 people to 5, and the time cost was shortened from 38 days to about 10 days. [Limitations] Some performance evaluation indicators are subjective and difficult to quantify using business data from scientific research management information systems. [Conclusions] The proposed method has excellent performance in predicting overall budget performance evaluation results. It reduces the fairness issues due to subjective evaluation, and saves the human resources and time costs in budget performance evaluation, thus improving their efficiency.

  • Shi Bin, Wang Hao, Liu Maolin, Deng Sanhong
    Data Analysis and Knowledge Discovery. 2024, 8(10): 146-158. https://doi.org/10.11925/infotech.2096-3467.2023.0688
    Abstract (96) PDF (26) HTML (37)   Knowledge map   Save

    [Objective] This study aims to construct a Chinese Ceramic Image Description Model (CCI-ClipCap) to provide technical support for ceramic culture research and digital preservation. [Methods] Based on ClipCap, the prompt paradigm is introduced to improve the model’s understanding of cross-modal data, enabling automatic description of ceramic images. Additionally, we proposed a text similarity evaluation method tailored for structured textual representation. [Results] The CCI-ClipCap model improved the multi-modal fusion process with the prompt paradigm, effectively extracting information from ceramic images and generating accurate textual descriptions. Compared to baseline models, the Bleu and Rouge values increased by 0.04 and 0.14, respectively. [Limitations] The data used originated from the British Museum collections, not native Chinese datasets. This single-source data may affect the model’s performance. [Conclusions] The CCI-ClipCap model generates text with rich levels of expression, demonstrating a soild understanding of ceramic knowledge and exhibiting high professionalism.

  • Hu Wei, Li Shuying, Zhang Xin, Yang Ning
    Data Analysis and Knowledge Discovery. 2024, 8(10): 28-43. https://doi.org/10.11925/infotech.2096-3467.2024.0737
    Abstract (46) PDF (30) HTML (38)   Knowledge map   Save

    [Objective] This study optimizes a link prediction model in the patent citation network to enhance the analysis and prediction of technological evolution. It also further improves theories and methods related to technology diffusion. [Methods] We constructed a new framework for link prediction modeling (Graph-PatentBERT-RF) based on the characteristics of patent literature. First, we used the GraphSAGE model to obtain the vectorized representation of the training set’s patent citation network. In contrast, the PatentBERT model provides semantic representation vectors of patent texts in four thematic dimensions. Then, these vectors were combined with other features to train a random forest model. Finally, we obtained the optimized link prediction probabilities in the patent citation network. [Results] An empirical study in quantum sensing demonstrated that the Graph-PatentBERT-RF model achieves optimal comprehensive prediction performance, with an F1-score over 2.2% higher than the baseline models. Our model also illustrated the nonlinear relationships and complex interactions across more than four levels among citation relationships, multidimensional technical text, and time lag features. [Limitations] The data preprocessing steps need further optimization to improve the model's performance. [Conclusions] The constructed model enhances the overall predictive performance of patent citation networks, providing an optimized solution to the current issue of incomplete citation data, and contributes to the development of various applications in technology evolution analysis based on citation networks.

  • Hu Tianyi, Liu Jianhua, E Haihong, Ding Junpeng, Qiao Xiaodong
    Data Analysis and Knowledge Discovery. 2024, 8(10): 125-135. https://doi.org/10.11925/infotech.2096-3467.2023.0937
    Abstract (43) PDF (27) HTML (34)   Knowledge map   Save

    [Objective] This study explores feature models for the automated detection of articles by “paper mills” across multiple dimensions. It aims to provide critical support for the governance of research integrity and quality control of academic publishing in China. [Methods] We retrieved retraction records and associated data resources of “paper mills” articles from websites like Retraction Watch to construct the first open dataset for training and evaluating the automated detection model for paper mills. We developed a classification model for “paper mill” papers (RWTA-Model) using a text random walk strategy and text attention mechanism. We modeled 33 grammatical features of “paper mills”. Finally, we used the SHAP method to identify significant features automatically. [Results] The F1 scores based on title structure features, abstract structure features, and main text structure features reached 0.7669, 0.8423, and 0.8480, respectively. For the three types of article structure data, the proposed method achieved the best results when compared to various baseline methods and identified 12 significant grammatical features. [Limitations] The supporting feature construction dataset primarily focuses on the biomedical field, presenting a potential risk of domain bias. [Conclusions] The constructed classification model based on title, abstract, and main text structures, and the 33-dimensional automatic detection feature model, can effectively identify “paper mill” papers and uncover multidimensional features, supporting the automated detection of papers from paper mills.

  • He Guoxiu, Ren Jiayu, Li Zongyao, Lin Chenxi, Yu Haiyan
    Data Analysis and Knowledge Discovery. 2024, 8(4): 1-13. https://doi.org/10.11925/infotech.2096-3467.2023.0684
    Abstract (379) PDF (219) HTML (149)   Knowledge map   Save

    [Objective] This study explores whether content-based deep detection models can identify the semantics of rumors. [Methods] First, we use the BERT model to identify the key features of rumors from benchmark datasets in Chinese and English. Then, we utilized two interpretable tools, LIME, based on local surrogate models, and SHAP, based on cooperative game theory, to analyze whether these features can reflect the nature of rumors. [Results] The key features calculated by the interpretable tools on different models and datasets showed significant differences, and it is challenging to decide the semantic relationship between the features and rumors. [Limitations] The datasets and models examined in this study need to be expanded. [Conclusion] Deep learning-based rumor detection models only work with the features of the training set and lack sufficient generalization and interpretability for diverse real-world scenarios.

  • Qi Xiaoying, Li Hanyu, Yang Haiping
    Data Analysis and Knowledge Discovery. 2024, 8(4): 76-87. https://doi.org/10.11925/infotech.2096-3467.2023.0081
    Abstract (224) PDF (272) HTML (90)   Knowledge map   Save

    [Objective] This paper aims to achieve multi-semantic classification of maps and meet the needs for precise map retrieval and intelligence analysis. [Methods] We designed a map category system and proposed a multi-label map classification strategy. It realized the automatic classification of South China Sea maps based on the AlexNet convolution neural network classification model. [Results] The F1 value of the proposed model is 0.979. This model can effectively realize the multi-label automatic classification of the South China Sea maps. [Limitations] The deep categories of multi-label annotated datasets need to be supplemented. [Conclusions] This paper provides a reference for the semantic-based scientific classification of maps, precise retrieval, and cross-category association.

  • Zhu Yujing, Chen Fang, Wang Xuezhao
    Data Analysis and Knowledge Discovery. 2024, 8(10): 1-13. https://doi.org/10.11925/infotech.2096-3467.2023.0699
    Abstract (155) PDF (44) HTML (118)   Knowledge map   Save

    [Objective] In response to Western technology export controls on China, this study proposes a method for identifying critical core technologies by mapping the U.S. Commerce Control List (CCL) to a patent-based dual-layer network. The goal is to provide a reference for selecting and prioritizing technology breakthrough directions. [Methods] The study integrates the CCL and patent data to build a dual-layer network consisting of a CCL-related network and a weighted patent citation network. We used a community detection algorithm to identify technology clusters in both layers and calculated the semantic similarity of inter-layer clusters to achieve automatic mapping. Using Word2Vec and the n-gram method, we extracted keywords from each cluster to represent technical topics. Finally, we identified the patent clusters with the highest similarity to the CCL clusters as critical core technologies. [Results] Empirical results in industrial software demonstrate that this method identifies 12 distinct patent clusters with the highest similarity to the CCL clusters, all of which have a similarity of over 0.85. They involve integrated circuit IP cores, precision measurement, process control, motion control, and turbine detection. Literature research has verified them as key core technologies in industrial software. [Limitations] The study only focused on industrial software for empirical research. The technical approach can be improved, and the identification results require further interpretation and analysis. [Conclusions] The proposed method efficiently and accurately identifies key core technology at a micro-level, features a high degree of automation, and is highly readable, providing significant practical application value.

  • Zhang Jinzhu, Sun Wenwen, Qiu Mengmeng
    Data Analysis and Knowledge Discovery. 2024, 8(10): 14-27. https://doi.org/10.11925/infotech.2096-3467.2023.0724
    Abstract (137) PDF (33) HTML (48)   Knowledge map   Save

    [Objective] This study aims to expand the heterogeneous network in citation recommendations by including more nodes and relationships. It seeks to provide deep semantic representations and reveal how different relationships impact citation recommendations, ultimately improving the effectiveness of such recommendations. [Methods] By introducing semantic links, we constructed a heterogeneous network representation learning model incorporating an attention mechanism. This model generates deep semantic and structural representations, as well as similarity metrics for citation recommendations. We also conducted ablation experiments to explore the impact of different factors on citation recommendation. [Results] After introducing semantic links, the citation recommendation model’s AUC improved by 0.012. With the addition of a dual-layer attention mechanism, there was a further improvement of 0.079 in AUC. Compared to the baseline model CR-HBNE, the AUC and AP improved by 0.185 and 0.204, respectively. [Limitations] Manual selection of relationship paths is inefficient, and evaluating the recommendation results based on only two metrics is relatively simplistic. [Conclusions] The proposed method fully utilizes the complex associations and deep semantic information among citations, effectively improving citation recommendation performance.

  • Xu Haoshuai, Hong Liang, Hou Wenjun
    Data Analysis and Knowledge Discovery. 2024, 8(10): 66-76. https://doi.org/10.11925/infotech.2096-3467.2023.0973
    Abstract (120) PDF (42) HTML (36)   Knowledge map   Save

    [Objective] This paper addresses the challenge of constructing label mapping in prompt learning-based relation extraction methods when labeled data is scarce. [Methods] The proposed approach enhances prompt effectiveness by injecting relational semantics into the prompt template. Data augmentation is performed through prompt ensemble, and an instance-level attention mechanism is used to extract important features during the prototype construction process. [Results] On the public FewRel dataset, the accuracy of the proposed method surpasses the baseline model by 2.13%, 0.55%, 1.40%, and 2.91% in four few-shot test scenarios, respectively. [Limitations] The method does not utilize learnable virtual prompt templates in constructing prompt templates, and there is still room for improvement in the representation of answer words. [Conclusions] The proposed method effectively mitigates the problem of limited information and insufficient accuracy in prototype construction under few-shot scenarios, improving the model’s accuracy in few-shot relation extraction tasks.

  • Duan Yufeng, Zhang Meicong, Liu Yanzuo, He Guoxiu
    Data Analysis and Knowledge Discovery. 2024, 8(10): 100-111. https://doi.org/10.11925/infotech.2096-3467.2023.0665
    Abstract (83) PDF (12) HTML (14)   Knowledge map   Save

    [Objective] This study aims to investigate the effectiveness of using phonetics and orthography features to enhance the representation of Chinese characters. [Methods] Based on the Named Entity Recognition (NER) task, we used a general embedding module, a bidirectional LSTM module, and a fully connected network with Softmax activation as the benchmark embedding layer, context encoding and decoding layers. Then, we compared the changes in Micro-F1 scores and entity-specific F1 scores after enhancing character embeddings with Chinese pinyin, images, Wubi input codes, Four-Corner codes, Cangjie codes, and radicals, using datasets such as MSRA, PeopleDaily, CCKS2017, Resume, and E-Commerce. [Results] Using phonetic and orthographic enhanced embeddings led to a performance decrease of nearly 0.01 in the MSRA and PeopleDaily datasets. At the same time, there was no statistically significant change in performance in the CCKS2017, Resume, and E-Commerce datasets. [Limitations] Using only 32×32 pixels images of Chinese simplified characters may affect the extraction of orthographic features. [Conclusions] While phonetic and orthographic features can enhance the representation of Chinese characters, they also introduce noise. They lead to varying impacts on model performance across different corpora and entities.

  • Wu Shufang, Wang Hongbin, Zhu Jie, Chen Ting
    Data Analysis and Knowledge Discovery. 2024, 8(10): 77-88. https://doi.org/10.11925/infotech.2096-3467.2023.0703
    Abstract (70) PDF (19) HTML (15)   Knowledge map   Save

    [Objective] This paper aims to expand social short texts by leveraging heterogeneous relationships in social networks. It addresses the issues of fragmentation and the use of internet slang in social short texts. [Methods] First, we measured the unevenness of hotspot words in social information based on dispersion, which improved the TF-IDF method to obtain initial features. Then, we constructed a two-layer heterogeneous social network consisting of three sub-networks based on the heterogeneous relationships in social networks. Finally, the importance of users, text similarity, and user recognition of social texts are quantified to obtain multiple extended sources and expand social short texts. [Results] Compared with the existing short text feature expansion methods, the proposed model’s precision, recall, and F1 value improved by about 13%, 19%, and 18%, respectively. [Limitations] We did not consider the influence of indirect relationships on the construction of heterogeneous social networks is not considered. [Conclusions] Using the heterogeneous relationships in social networks can obtain more reasonable expansion sources and effectively expand social short texts.

  • Cheng Quan, Jiang Shihui, Li Zhuozhuo
    Data Analysis and Knowledge Discovery. 2024, 8(10): 112-124. https://doi.org/10.11925/infotech.2096-3467.2023.0638
    Abstract (87) PDF (22) HTML (15)   Knowledge map   Save

    [Objective] This paper aims to achieve semantic discovery and relation extraction from a large amount of complex user-generated information from an online healthcare platform. [Methods] First, we constructed a semantic discovery model for online health information based on an improved CasRel model. Then, we introduced the ERNIE-Health pre-trained model, which is more suitable for the healthcare domain, into the text encoding layer of the CasRel-based model. Finally, we used a multi-level pointer network in the entity and relation decoding layer to annotate and fuse subject features for relations and object decoding via neural networks. [Results] Compared to the original model, the improved CasRel entity-relation extraction model increased the F1-scores of entity recognition and entity-relation extraction tasks for online health information semantic discovery by 7.62% and 4.87%, respectively. [Limitations] The overall effectiveness of the model still needs to be validated with larger datasets and empirical studies on health information from different disease types. [Conclusions] Three sets of comparative experiments validated the effectiveness of the improved CasRel entity-relation extraction model for online diabetes health information semantic discovery tasks.

  • Huang Taifeng, Ma Jing
    Data Analysis and Knowledge Discovery. 2024, 8(3): 77-84. https://doi.org/10.11925/infotech.2096-3467.2023.0004
    Abstract (534) PDF (366) HTML (162)   Knowledge map   Save

    [Objective] This paper aims to improve the low accuracy of sentiment classification using the pre-trained model with insufficient samples.[Methods] We proposed a prompt learning enhanced sentiment classification algorithm Pe(prompt ensemble)-RoBERTa. It modified the RoBERTa model with integrated prompts different from the traditional fine-tuning methods. The new model could understand the downstream tasks and extract the text’s sentiment features. [Results] We examined the model on several publicly accessible Chinese and English datasets. The average sentiment classification accuracy of the model reached 93.2% with fewer samples. Compared with fine-tuned and discrete prompts, our new model’s accuracy improved by 13.8% and 8.1%, respectively. [Limitations] The proposed model only processes texts for the sentiment dichotomization tasks. It did not involve the more fine-grained sentiment classification tasks. [Conclusions] The Pe-RoBERTa model can extract text sentiment features and achieve high accuracy in sentiment classification tasks.

  • Zhao Jiayi, Xu Yuemei, Gu Hanwen
    Data Analysis and Knowledge Discovery. 2024, 8(10): 44-53. https://doi.org/10.11925/infotech.2096-3467.2023.0714
    Abstract (179) PDF (37) HTML (31)   Knowledge map   Save

    [Objective] This study addresses the performance degradation due to catastrophic forgetting when multilingual models handle tasks in new languages. [Methods] We proposed a multilingual sentiment analysis model, mLMs-EWC, based on continual learning. The model incorporates continual learning into multilingual models, enabling it to learn new language features while retaining the linguistic characteristics of previously learned languages. [Results] In continual sentiment analysis experiments involving three languages, the mLMs-EWC model outperformed the Multi-BERT model by approximately 5.0% in French and 4.5% in English tasks. Additionally, the mLMs-EWC model was evaluated on a lightweight distilled model, showing an improvement of up to 24.7% in English tasks. [Limitations] This study focuses on three widely used languages, and further validation is needed to assess the model’s generalization capability to other languages. [Conclusions] The proposed model can alleviate catastrophic forgetting in multilingual sentiment analysis tasks and achieve continual learning on multilingual datasets.

  • Yu Bengong, Cao Chengwei
    Data Analysis and Knowledge Discovery. 2024, 8(10): 54-65. https://doi.org/10.11925/infotech.2096-3467.2023.0722
    Abstract (132) PDF (34) HTML (24)   Knowledge map   Save

    [Objective] This paper aims to address the problem in current aspect-based sentiment analysis research, where the use of sentiment knowledge to enhance syntactic dependency graphs overlooks syntactic reachability and positional relationships between words and does not adequately extract semantic information. [Methods] We proposed an aspect-based sentiment analysis model based on a position-weighted reachability matrix and multi-space semantic information extraction. First, we used a reachability matrix to incorporate syntactic reachability relationships between words into the syntactic dependency graph, and we employed position-weighting to adjust the matrix to enhance contextual feature extraction. Then, we integrated the sentiment features with the enhanced dependency graph to extract aspect word features. Third, we use the multi-head self-attention mechanism combined with a graph convolutional network (GCN) to learn contextual semantic information from multiple feature spaces. Finally, we fused feature vectors containing positional information, syntactic information, affective knowledge, and semantic information for sentiment polarity classification. [Results] Compared to the best-performing models, the proposed model improved accuracy on the Lap14, Rest14, and Rest15 datasets by 1.00%, 1.25%, and 0.76%. When using BERT, the PRM-GCN- BERT model’s accuracy on the Lap14, Rest14, Rest15, and Rest16 datasets increased by 0.50%, 0.22%, 1.98%, and 0.31%. [Limitations] The proposed model was not applied to Chinese or other language datasets. [Conclusions] The proposed model enhances feature aggregation in graph convolutional networks, improves contextual feature extraction, and boosts semantic learning effectiveness, thereby significantly improving the accuracy of aspect-based sentiment analysis.

  • Li Xuesi, Zhang Zhixiong, Wang Yufei, Liu Yi
    Data Analysis and Knowledge Discovery. 2024, 8(1): 1-15. https://doi.org/10.11925/infotech.2096-3467.2023.1280
    Abstract (396) PDF (1808) HTML (70)   Knowledge map   Save

    [Objective] Domain knowledge evolution analysis has been a long-standing research topic in the field of Library and Information Science. This paper provides a comprehensive review of the research methods related to the domain knowledge evolution analysis, both nationally and internationally, aiming to offer valuable references for future studies in this area. [Coverage] We conducted searches in CNKI and Web of Science using keywords related to domain knowledge evolution. The search results were manually evaluated and analyzed, and a total of 84 key literatures closely related to the methods of domain knowledge evolution analysis were selected for review. [Methods] By reviewing the research literature, we clarified the relevant concepts of domain knowledge evolution. Based on this, we classified the existing domain knowledge evolution analysis methods into three categories: citation-based, structure-based and content-based. For each category, we first elucidated the theoretical basis, then explained their basic analytical frameworks and highlighted the relevant advances. Finally, we summarized the existing methods of domain knowledge evolution analysis and provided perspectives. [Results] The three categories of existing methods for domain knowledge evolution analysis rely on their respective scientific theories. With the advancement of technology and the improvement of data resources, these methods are continuously deepening and improving the analytical framework for the study of evolution. Although significant research achievements have been made, there has been no breakthrough in the research perspective of knowledge evolution analysis, and the limitations within the current research paradigm remain unresolved. [Limitations] The review analysis was based on selected literature, which may not have comprehensively covered all relevant research. [Conclusions] Based on the summary and analysis of the current research, we believe that the following two directions are worth focusing on in the future research on domain knowledge evolution analysis: first, exploring new entry points for domain knowledge evolution analysis, and second, attempting to integrate existing research methods to improve the limitations of current analytical approaches.

  • Fu Yun, Zhu Liya, Li Dan, Sun Mengge, Zhang Jianfeng, Liu Xiwen
    Data Analysis and Knowledge Discovery. 2024, 8(1): 30-39. https://doi.org/10.11925/infotech.2096-3467.2023.0867
    Abstract (236) PDF (2137) HTML (38)   Knowledge map   Save

    [Objective] This study addresses the unified representation issue of experimental operation verbs in synthetic experiment protocols, which provides high-quality experimental protocol data for science intelligence and robotics. [Methods] We utilized a collaborative approach driven by data and expert knowledge to identify and standardize experimental operation verbs from literature and patent texts related to synthesis. First, we used advanced open-source large models like ChatGLM2-6B to identify experimental operation verbs. Then, we combined Wu-Palmer and cosine similarity to standardize these verbs. Finally, we assessed their classification accuracy with expert knowledge. [Results] The study identified 149 operation verbs for inorganic synthetic experiments and 141 operation verbs for organic synthetic experiments. Expert judgment revealed that many of the 124 operation terms appearing in both groups do not possess distinct category characteristics. Therefore, we merged the two categories to have 166 experimental operation verbs representing the operations in organic, inorganic, and hybrid synthesis experiments. [Limitations] The study only employed basic prompt engineering techniques to direct the large model to recognize experimental operation verbs from publicly accessible datasets. This study focused on operation terms involved in synthesis, engineering, and basic steps without considering operation terms in dynamic, analytical, and name reactions. [Conclusions] This study establishes a unified language for representing experimental operations in synthesis, applicable to organic, inorganic, and hybrid synthesis reactions. It could inform the future development of scientific robotics experiments.