Data Analysis and Knowledge Discovery

Select

Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation

Li Wenna, Zhang Zhixiong

Data Analysis and Knowledge Discovery. 2021, 5(7): 1-9. https://doi.org/10.11925/infotech.2096-3467.2021.0143

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper combines the structure and semantic information of knowledge, aiming to create a better entity alignment method for different knowledge repositories. [Methods] First, we used the TransE model to represent the structure of entities, and used the BERT model to represent their semantic information. Then, we designed an entity alignment method based on the BTJE model (BERT and TransE Joint model for Entity alignment). Finally, we use the siamese network model to finish entity alignment tasks. [Results] We examined the new method with DBP-WD and DBP-YG datasets. Their optimal MRR values reached 0.521 and 0.413, while the Hits@1 reached 0.542 and 0.478. These results were better than those of the traditional models. [Limitations] The size of our experimental data set needs to be expanded, which will further evaluate the performance of the proposed method. [Conclusions] Our new method could effectively finish entity alignment tasks for different knowledge bases.

Select

Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation

Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei

Data Analysis and Knowledge Discovery. 2021, 5(7): 10-25. https://doi.org/10.11925/infotech.2096-3467.2020.1230

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper investigates the performance of entity recognition models for legal judgments, aiming to construct better legal knowledge base in the future. [Methods] First, we extracted the court trial process and court opinions from criminal judgment texts to build an experimental dataset. Then, we compared the entity recognition results of the CRFs model (with artificially constructed features), the IDCNN-CRFs model (with automatically generated features), and the BiLSTM-CRFs model. Both of the IDCNN-CRFs and BiLSTM-CRFs models used pre-trained word vectors for their char embedding. The models’ transferred abilities on other types of legal judgment texts were also compared. [Results] The ALBERT-BiLSTM-CRFs model had the best recognition performance. Its F1 micro-average value reached 95.28%. However, the training time of the IDCNN-CRFs model was about 1/6 of the ALBERT-BiLSTM-CRFs model. Both models had good transferred abilities. [Limitations] Most of the recognized entities were the general ones. More domain-related entities are needed in future studies to enhance the model’s practical value. [Conclusions] The ALBERT-BiLSTM-CRFs and IDCNN-CRFs models could more effectively recognize entities from legal judgments and show better transferred ability than the CRFs model.

Select

Extracting Events from Ancient Books Based on RoBERTa-CRF

Yu Xuehan, He Lin, Xu Jian

Data Analysis and Knowledge Discovery. 2021, 5(7): 26-35. https://doi.org/10.11925/infotech.2096-3467.2021.0094

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs a framework to extract events from ancient books, which uses the RoBERTa-CRF model to identify event types, argument roles and arguments. [Methods] We collected the war sentences from Zuozhuan as the experimental data, which helped us establish the classification schema for event types and argument roles. Based on the RoBERTa-CRF model, we used the multi-layer transformer to extract the corpus features, which were combined with the sequence tags to learn the correlation constraints. Finally, we identified and extracted the arguments by the tag sequence. [Results] The accuracy, recall and F1 values of the proposed model were 87.6%, 77.2% and 82.1%, which were higher than results of the GuwenBERT-LSTM, Bert-LSTM, RoBERTa-LSTM, Bert-CRF and RoBERTa-CRF on the same dataset. [Limitations] The size of the experimental dataset needs to be expanded, which could make the topic categories more balanced. [Conclusions] The RoBERTa-CRF model constructed in this paper could effectively extract events from ancient Chinese books.

Select

Extracting Financial Events with ELECTRA and Part-of-Speech

Chen Xingyue, Ni Liping, Ni Zhiwei

Data Analysis and Knowledge Discovery. 2021, 5(7): 36-47. https://doi.org/10.11925/infotech.2096-3467.2020.1296

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a method to extract financial events based on the ELECTRA model and part-of-speech, aiming to address the issues of blurred entity boundaries and inaccurate extractions. [Methods] First, we input corpus to two models pre-trained by ELECTRA, which identified key entities, the original semantic information, and part-of-speech. Then, we used the BiGRU model to extract contextual semantic dependency and generated the original sequence tags. Finally, we addressed the issues of label deviation with the CRF model and extracted the financial events. [Results] We examined the new model with financial event dataset and found its F-value reached 70.96%, which was 20.74 percentage point higher than the BiLSTM-CRF model. [Limitations] The number of events in the dataset needs to be increased. The size of pre-trained model is large, which might be limited by the memory of GPU/TPU. [Conclusions] The model based on ELECTRA and part-of-speech could effectively identify the relationships among financial events to extract them.

Select

Sentence Alignment Method Based on BERT and Multi-similarity Fusion

Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng

Data Analysis and Knowledge Discovery. 2021, 5(7): 48-58. https://doi.org/10.11925/infotech.2096-3467.2021.0033

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a method automatically aligning bilingual sentences, aiming to provide technical support for constructing bilingual parallel corpus, cross-language information retrieval and other natural language processing tasks. [Methods] First, we added the BERT pre-training to the method of sentence alignment, and extracted features with a two-way Transformer. Then, we represented the words’ semantics with Position embeddings, Token embeddings, and Segment embeddings. Third, we bi-directionally measured the source language sentence and its translation, as well as the target language sentence and its translation. Finally, we combined the BLEU score, cosine similarity and Manhattan distance to generate the final sentence alignment. [Results] We conducted two rounds of tests to evaluate the effectiveness of the new method. In the parallel corpus filtering task, the recall was 97.84%. In the comparable corpus filtering task, the accuracy reached 99.47%, 98.31%, and 95.00%, when the noise ratio was 20%, 50%, and 90%, respectively. [Limitations] The text representation and similarity calculation could be further improved by adding more semantic information. [Conclusions] The proposed method, which is better than the baseline systems in parallel corpus filtering and comparable corpus filtering tasks, could generate large scale and high-quality parallel corpus.

Select

RLCPAR: A Rewriting Model for Chinese Patent Abstracts Based on Reinforcement Learning

Zhang Le, Leng Jidong, Lv Xueqiang, Cui Zhuo, Wang Lei, You Xindong

Data Analysis and Knowledge Discovery. 2021, 5(7): 59-69. https://doi.org/10.11925/infotech.2096-3467.2021.0089

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a rewriting model for Chinese patent abstracts based on reinforcement learning (RLCPAR), aiming to address the issues of sentence redundancy and low accuracy in rewriting multi-sentence abstracts. [Methods] First, we used the RLCPAR to extract key sentences from patent descriptions with the help of patent term dictionary and reinforcement learning. Then, we generated the candidate abstracts using the Transformer deep neural network. Finally, we merged the candidate abstracts with the original patent abstracts to obtain the rewritten abstracts after semantic de-duplication and sorting. [Results] The proposed model effectively finished the end-to-end rewriting of patent abstracts. The scores of RLCPAR were 56.95%, 37.21% and 51.24% with the ROUGE-1, ROUGE-2 and ROUGE-L criteria. [Limitations] The experimental data, which were mainly on Chinese medicine materials, needs to be expanded to other fields. [Conclusions] The PLCPAR model is much better than other sequence generation methods and improves the rewriting quality of Chinese patent abstracts.

Select

Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning

Zhao Danning,Mu Dongmei,Bai Sen

Data Analysis and Knowledge Discovery. 2021, 5(7): 70-80. https://doi.org/10.11925/infotech.2096-3467.2020.1139

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a deep learning-based method to automatically extract key elements from unstructured abstracts of sci-tech literature. [Methods] We used structured abstracts as the training corpus, and utilized deep learning methods (e.g., LSTM and the attention mechanism) to extract “objective”, “method” and “results” from the sci-tech literature, and then generated new structured abstracts. [Results] The method’s F-scores were 0.951, 0.916, and 0.960 respectively for the three structural elements of “objective”, “method”, and “results”. [Limitations] The deep learning technique in this paper is relatively uninterpretable. [Conclusions] The proposed method could effectively extract elements from unstructured abstracts.

Select

Constructing Knowledge Graph with Public Resumes

Shen Kejie, Huang Huanting, Hua Bolin

Data Analysis and Knowledge Discovery. 2021, 5(7): 81-90. https://doi.org/10.11925/infotech.2096-3467.2021.0145

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs knowledge graph based on the public resume data with natural language processing technology, which provides new tool for traditional data analysis. [Context] The proposed method could automatically extract profesional backgrounds and job information from resumes, and then obtain the relationship of working experience and colleagues in the organizations. The visualized knowledge graph could provide decision support for talent selection, personnel appointment and removal tasks of enterprises and institutions. [Methods] First, we used crawler to obtain the resume data and used the BERT-BiLSTM-CRF model to recognize entities. Then, we established the relationship between entities by defining rules and integrating the external domain knowledge. Finally, we used neo4j graph database to store and visualize data. [Results] The accuracy of the BERT-BiLSTM-CRF model with the entity recognition task was 84.85%. The constructed knowledge graph, which included resumes of 561 people, 8,174 entities in 3 categories, and 20,162 relationships in 5 categories, could support multi-angle queries and data mining. [Conclusions] This proposed model explores the internal relationships among resumes and provides a novel way to analyze resumes. However, there are few precise entity alignment processing and the establishment of relationships among institution entities.

Select

A Multi-Label Classification Model with Two-Stage Transfer Learning

Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting

Data Analysis and Knowledge Discovery. 2021, 5(7): 91-100. https://doi.org/10.11925/infotech.2096-3467.2020.1173

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a multi-label classification model, aiming to improve data sampling and add common characteristics of the existing models. [Methods] We constructed a two-stage migration learning model of “common domain - single tag data in the target domain - multiple tag data”. Then, we trained this model in the general and the target fields, as well as fine-tuned it with the single label data balanced with the over-sampling method. Finally, we migrated the model to multi-label data and generated multi-label classification. [Results] We examined the new model with image annotations from medical literature. On multi-label classification tasks for images and texts, the F₁ score was improved by more than 50% compared to the one-stage transfer learning model. [Limitations] More research is needed to choose better basic model and sampling method for different tasks. [Conclusions] This proposed method coud be used in annotation, retrieval and utilization of big data sets with constraints.

Select

Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism

Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong

Data Analysis and Knowledge Discovery. 2021, 5(7): 101-110. https://doi.org/10.11925/infotech.2096-3467.2020.1216

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new model with the help of uncertainty loss function and task-level attention mechanism, aiming to address the issue of setting main and auxiliary tasks in rumor detection. [Methods] First, we integrated the domain knowledge of rumor exploration, stance classification, and rumor detectioin. Then, we constructed a modified model with task-level attention mechanism. Third, we used uncertainty loss function to explore the weight relationshaip of each task and obtain better detection results. Finally, we examined our model’s performance with the Pheme4 and Pheme5 datasets. [Results] Compared to the exisiting models, the Macro-F of our model increased by 4.2 and 7.6 percentage points with Pheme4 and Pheme5. [Limitations] We only examined our model with the Pheme dataset. [Conclusions] The proposed method could effective detect rumors without dividing the main and auxiliary tasks.

Select

Quantifying and Examining Privacy Paradox of Social Media Users

Zhu Hou,Fang Qingyan

Data Analysis and Knowledge Discovery. 2021, 5(7): 111-125. https://doi.org/10.11925/infotech.2096-3467.2021.0140

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new model based on the traditional privacy calculus, aiming to effectively quantify the privacy paradox behaviors of social media users. [Methods] First, we quantified users’ information with the IRT model and grey relational analysis. Then, we built a model from the perspectives of the balanced benefits and risks. Third, we calculated and analyzed the equilibrium solution on social platform with this new model. Finally, we evaluated our model’s performance with some real-world users’ information. [Results] The perceived benefits of most social media users was higher than the perceived risks, which indicated the existence of privacy paradox and was in line with the real world situation. [Limitations] We did not fully examine the perceived benefit framework due to the lack of data. There is no proven standard for merging the proposed model’s two sections. [Conclusions] The proposed model supports the privacy paradox with objective data and lays a foundation for studying users’ privacy behaviors on social media.

Select

Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model

Xu Yuemei, Wang Zihou, Wu Zixin

Data Analysis and Knowledge Discovery. 2021, 5(7): 126-138. https://doi.org/10.11925/infotech.2096-3467.2020.0907

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] Based on the traditional financial data analysis, this paper explores the impacts of online news on stock market, aiming to improve the accuracy of predicting stock trends. [Methods] First, we used the Convolutional Neural Network (CNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) to extract news events and their sentiment orientations. Then, we proposed a prediction model for stock trends, which combines the stock numerical data and the news event sentiments. Finally, we examined the feasibility of this model with two individual stocks (GREE Electric Appliance in the household appliance industry and ZTE in the electronic appliance industry). [Results] The prediction accuracy of our model was 11.6% and 25.6% higher than the exiting algorithms. [Limitations] We did not evaluate the impacts of prediction period on the performance of the proposed model. [Conclusions] The news events and their sentiment orientations could lead to the fluctuation of stock prices.

Please choose a citation manager

Content to export

25 July 2021, Volume 5 Issue 7

模态框（Modal）标题

Please choose a citation manager

Content to export

25 July 2021, Volume 5 Issue 7