Data Analysis and Knowledge Discovery

Current Issue

, Volume 7 Issue 1

Previous Issue Next Issue

For Selected:

View Abstracts

Download Citations
EndNote Reference Manager ProCite BibTeX RefWorks

Toggle Thumbnails

Select

Cross-Lingual Sentiment Analysis: A Survey

Xu Yuemei, Cao Han, Wang Wenqing, Du Wanze, Xu Chengyang

2023, 7 (1): 1-21. DOI: 10.11925/infotech.2096-3467.2022.0472

Abstract

HTML ( 65 )

PDF(1427KB) ( 2388 )

[Objective] This paper teases out the research context of cross-lingual sentiment analysis (CLSA). [Coverage] We searched “TS=cross lingual sentiment OR cross lingual word embedding” in Web of Science database and 90 representative papers were chosen for this review. [Methods] We elaborated the following CLSA methods in detail: (1) The early main methods of CLSA, including those based on machine translation and its improved variants, parallel corpora or bilingual sentiment lexicon; (2) CLSA based on cross-lingual word embedding; (3) CLSA based on Multi-BERT and other pre-trained models. [Results] We analyzed their main ideas, methodologies, shortcomings, etc., and attempted to reach a conclusion on the coverage of languages, datasets and their performance. It is found that although pre-trained models such as Multi-BERT have achieved good performance in zero-shot cross-lingual sentiment analysis, some challenges like language sensitivity still exist. Early CLSA methods still have some inspirations for existing researches. [Limitations] Some CLSA models are mixed models and they are classified according to the main methods. [Conclusions] We look into the future development of CLSA and the challenges facing the research area. With in-depth research of pre-trained models on multi-lingual semantics, CLSA models fit for more and wider languages will be the future direction.

Figures and Tables | References | Related Articles | Metrics

Select

Responsible Research and Innovation: Knowledge Base and Research Hotspots

Yang Defang, Tang Li

2023, 7 (1): 22-34. DOI: 10.11925/infotech.2096-3467.2022.0428

Abstract

HTML ( 28 )

PDF(3836KB) ( 294 )

[Objective] This paper analyzes the status quo and evolution of knowledge base and research hotspots of responsible research and innovation (RRI) based on international literature. [Coverage] We retrieved and analyzed a total of 657 international articles on RRI from the Web of Science. [Methods] We used the bibliometrics and visualization techniques to explore these articles. [Results] Researchers from the Netherlands and the United Kingdom played a leading role in the field of responsible research and innovation, while China has not established its position in this domain yet. Technology assessment and anticipatory governance, conceptual development in the EU context, as well as refined framework in global context are the three knowledge bases of RRI. Science, society and governance, conceptual framework and practice, ethics and value of technology development, sustainability and impacts are the popular topics. [Limitations] The data coverage needs to be expanded. [Conclusions] This study calls for more diversity in research methods and expanding the application of RRI research in various contexts.

Figures and Tables | References | Related Articles | Metrics

Select

Automatically Extracting Technical Indicators from U.S. Commerce Control List

Yuan Yue, Pang Na, Li Guangjian

2023, 7 (1): 35-48. DOI: 10.11925/infotech.2096-3467.2022.0571

Abstract

HTML ( 25 )

PDF(2070KB) ( 154 )

[Objective] This paper proposes a method to automatically extract technical indicators from the “U.S. Commerce Control List”, aiming to better understand technical details of the listed products and the U.S. export control policies. [Methods] We represented the technical indicators as their objects, names, relationships, and values. Then, we proposed an automated model to extract technical indicators, and stored them as structured four-element records. [Results] The proposed method effectively extract the technical indicators from the “Commerce Control List” in a non-supervised manner. The precision and F1 values of our method reached 87.34% and 86.52%, respectively. [Limitations] The proposed extraction method is mainly for the text of the “Commerce Control List”, and more research is needed to examine it with other corpus. [Conclusions] This proposed method could effectively extract technical indicators from “Commerce Control List” of the United States.

Figures and Tables | References | Related Articles | Metrics

Select

The Ideal and Reality of Metaverse: User Perception of VR Products Based on Review Mining

Cao Zhe, Guo Huilan, Wu Jiang, Hu Zhongyi

2023, 7 (1): 49-62. DOI: 10.11925/infotech.2096-3467.2022.0371

Abstract

HTML ( 26 )

PDF(1970KB) ( 554 )

[Objective] This paper investigates the gap between users’ perception of VR products and the ideal technical requirements of the metaverse, aiming to support the latter’s optimization. [Methods] First, we retrieved 36 720 user reviews of 64 VR products sold by JD.com. Then, we used the LDA topic model and BERT language model to construct indicators of attention and affection. Third, we quantitatively analyzed the users’ perception of VR products(technology). Finally, we compared these objective attributes of VR products and the technical requirements of the metaverse. [Results] We extracted five perceived attributes (function, quality control, use feeling, marketing and audio-visual experience) from the reviews. The audio-visual experience has the highest attention and affection while marketing is the lowest. The function, use feeling and audio-visual experience have eight progressive or regressive manifestations in the four dimensions of technical requirements in the metaverse (immersion experience, accessibility, interoperability and scalability). The eight manifestations are high immersion, sensory imbalance, multiple connections, time and space constraints, multiplayer interaction, mobile obstacles, multi-functional design and equipment problems. [Limitations] The diversity and balance of samples need to be improved, and more research should be conducted on other types of metaverse equipment. [Conclusions] The existing VR products can meet the technical requirements of the metaverse in immersion experience, but there is still a long way to go to achieve accessibility, interoperability and scalability.

Figures and Tables | References | Related Articles | Metrics

Select

Mining Differentiated Demands with Aspect Word Extraction: Case Study of Smartphone Reviews

Xiao Yuhan, Lin Huiping

2023, 7 (1): 63-75. DOI: 10.11925/infotech.2096-3467.2022.0207

Abstract

HTML ( 29 )

PDF(1090KB) ( 396 )

[Objective] This paper proposes a new deep learning algorithm to extract aspect words, aiming to achieve differentiated and refined user demand analysis. [Methods] We designed a Context Window Self-Attention (CWSA) model to extract aspect words. This model focuses on semantics of the context window and adjacent texts based on overall information of the full-texts. Then, we extracted the fine-grained product features from their reviews. Finally, we conducted the aspect-level sentiment analysis to further examine user demands. [Results] The paper constructed a Chinese dataset for aspect word extraction and aspect-level sentiment analysis with nearly 900,000 reviews of smartphones sold by JD.com. The proposed CWSA model’s F1 score reached 89.65% on this dataset, which was better than those of the baseline models. [Limitations] There are limited publicly accessible Chinese datasets for aspect word extraction and aspect-level sentiments. More Chinese and English datasets of multiple products need to be constructed to improve our model’s cross-language adaptability. [Conclusions] The proposed model improves differentiated and refined data mining.

Figures and Tables | References | Related Articles | Metrics

Select

Comparing Assessments of Computer-related Courses in Chinese and American Universities Based on Syllabus

Liu Siyuan, Feng Leilin, Zhu Zhangqian, Jia Tao

2023, 7 (1): 76-88. DOI: 10.11925/infotech.2096-3467.2022.0490

Abstract

HTML ( 27 )

PDF(1313KB) ( 287 )

[Objective] This paper analyzes the differences of course assessments between Chinese and American universities, aiming to provide empirical support for engineering education reforms. [Methods] First, we retrieved more than 47,000 valid computer science related Chinese and English syllabi through automated data crawling. Then, we extracted the assessment section of these syllabi by keywords. Finally, we compared and analyzed the contents and methods of these assessments quantitatively. [Results] Compared with courses from Chinese universities, American universities assessed their students with more categories and diversified contents. American universities attach great importance to students’ progress, while Chinese universities focus too much on final exams. [Limitations] There is limited number of Chinese syllabi available online. Many syllabi contain much unstructured textual data, therefore, errors might occur while using keywords to examine their contents. [Conclusions] This study provides empirical support for course assessment optimization in China.

Figures and Tables | References | Related Articles | Metrics

Select

Detecting Research Frontiers Based on Twitter

Wuxihong Jiangbulati, Wang Xiaomei, Chen Ting

2023, 7 (1): 89-101. DOI: 10.11925/infotech.2096-3467.2022.0111

Abstract

HTML ( 35 )

PDF(799KB) ( 324 )

[Objective] This paper designs a Twitter-based method to identify emerging research topics, aiming to identify the latest developments of a specific discipline. [Methods] First, we analyzed the principles and practices of using Twitter to identify research topics. Then, we proposed a monitoring index system based on the influence of scholars and contents. Third, we conducted an empirical analysis in the field of natural language processing (NLP). [Results] The detection model is able to identify emerging research topics in NLP in a timely manner. Compared with reports on NLP status quo, 8 of the 13 research frontiers were successfully identified. [Limitations] Due to the open nature of social media, it is difficult to completely avoid subject-independent noise contents during dataset construction. [Conclusions] The proposed method is based on the scholarly UGC contents on Twitter, which is a feasible and effective way to detect the research frontiers of the discipline in a timely and forward-looking way.

Figures and Tables | References | Related Articles | Metrics

Select

Identifying Interdisciplinary Sci-Tech Literature Based on Multi-Label Classification

Wang Weijun, Ning Zhiyuan, Du Yi, Zhou Yuanchun

2023, 7 (1): 102-112. DOI: 10.11925/infotech.2096-3467.2022.0358

Abstract

HTML ( 19 )

PDF(1289KB) ( 460 )

[Objective] This paper tries to identify interdisciplinary sci-tech literature, aiming to find emerging interdisciplinary issues. [Methods] We combined the discipline labels of sci-tech literature provided by specialists with labels predicted by text classification algorithms to find interdisciplinary studies. [Results] The F1 value of the proposed method reached 0.45, which was 0.22 higher than those of the model-based predictions. [Limitations] The model had low recall values for identifying the interdisciplinary sci-tech research. [Conclusions] The paper effectively addresses the classification issues of interdisciplinary sci-tech literature, which merits more studies in the future.

Figures and Tables | References | Related Articles | Metrics

Select

A Modified Hybrid Method to Identify Cited Spans

Nie Weimin, Ou Shiyan

2023, 7 (1): 113-127. DOI: 10.11925/infotech.2096-3467.2022.0402

Abstract

HTML ( 14 )

PDF(1105KB) ( 337 )

[Objective] This paper proposes a new algorithm to identify the cited contents, aiming to address the issues facing the existing unsupervised models and extend the granularity of single sentence to several adjacent ones. [Methods] First, we established a modified hybrid method with supervised ranking to select candidates from all sentences of the cited literature. Then, we used regression technique to determine the sentences with the cited segments. Third, we used the grouped adjacent sentences of the cited literature, namely n-sent, as inputs to the modified hybrid method. Finally, we conducted the intraclass normalization to identify the cited contents. [Results] The modified hybrid method yielded sentence overlapping F₁ value of 0.167 on the test set of CL-SciSumm 2019 and 2020. With 3-sent as input, the modified hybrid method improved the sentence overlapping F₁ value from 0.083 to 0.158 after intraclass Z-score normalization. [Limitations] The modified hybrid method did not utilize the sentence positions of the cited literature. In addition, the prospect of applying the proposed method to downstream tasks remains vague. [Conclusions] The proposed method could effectively identify cited segments, of which the granularity ranges from single sentence to multiple adjacent sentences.

Figures and Tables | References | Related Articles | Metrics

Select

Classifying Customer Complaints Based on Multi-head Co-attention Mechanism

Wang jinzheng, Yang Ying, Yu Bengong

2023, 7 (1): 128-137. DOI: 10.11925/infotech.2096-3467.2022.0258

Abstract

HTML ( 32 )

PDF(1116KB) ( 475 )

[Objective] This paper tries to improve the insufficient learning of the relationship between features in the traditional text classification model. [Methods] We developed a text classification model for customer complaints based on multi-head co-attention mechanism. Firstly, we used the BERT pre-training model to create text vectors. Then, we constructed the Text-CNN and BiLSTM multi-channel feature networks to extract the local and global features of the complaints. Finally, we used the collaborative attention mechanism to learn the relationship between the local and global features to classify complaints. [Results] We examined our model with a public dataset (THUCNews) and its accuracy reached 97.25%, while the accuracy on the telecom customer complaint dataset reached 86.20%. Compared with the single channel baseline model with the best performance and the multi-channel model without feature interaction, the accuracy of the proposed model on telecom customer complaint dataset was improved by 0.54% and 0.35%, respectively. [Limitations] We only examined the interaction between the two features. With the small-scale telecom customer complaint dataset, the classification of some complaint is not satisfactory. [Conclusions] Multi-channel feature extraction network can enrich text information and fully extract text features. Co-attention mechanism can effectively learn the relationship between text features, and improve the model’s classification performance.

Figures and Tables | References | Related Articles | Metrics

Select

Reasoning Model for Temporal Knowledge Graph Based on Entity Multiple Unit Coding

Peng Cheng, Zhang Chunxia, Zhang Xin, Guo Jingtao, Niu Zhendong

2023, 7 (1): 138-149. DOI: 10.11925/infotech.2096-3467.2022.0225

Abstract

HTML ( 25 )

PDF(1568KB) ( 864 )

[Objective] This paper tries to address the issues of incomplete entity information extraction and importance measurement of different timestamps for the events to be reasoned in temporal knowledge graph. [Methods] We proposed a new model based on entity multiple unit coding(EMUC). The EMUC introduces the entity slice feature encodings for the current timestamps, the entity dynamic feature encodings fusing timestamp embedding and entity static features, as well as entity segment feature encodings over historical steps. We also utilized a temporal attention mechanism to learn the importance weights of local structural information at different timestamps to the inference target. [Results] The experimental results of the proposed model on the ICEWS14 test set were MRR: 0.470 4, Hits@1: 40.31%, Hits@3: 50.02%, Hits@10: 59.98%, while on the ICEWS18 test set were MRR: 0.438 5, Hits@1: 37.55%, Hits@3: 46.92%, Hits@10: 56.85%, and on the YAGO test set are MRR: 0.656 4, Hits@1: 63.07%, Hits@3 : 65.87%, Hits@10: 68.37%. Our model outperforms the existing methods on these evaluating metrics. [Limitations] EMUC has slow inference speed for large-scale datasets. [Conclusions] The newly temporal attention mechanism measures the importance of historical local structure information for reasoning, which effectively improves the reasoning performance of the temporal knowledge graph.

Figures and Tables | References | Related Articles | Metrics