Data Analysis and Knowledge Discovery

Select

Quantification Constraint System for Pragmatic Disambiguation: From Linguistic Design to Computational Implementation

Yang Chunlei

Data Analysis and Knowledge Discovery. 2017, 1(11): 1-11. https://doi.org/10.11925/infotech.2096-3467.2017.0877

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This article tries to bridge ontological linguistic design and its computational implementation, taking quantification constraint system (QCS) for pragmatic disambiguation as an example. [Methods] First, this study explained the rationale of QCS, and introduced the new research methodology. Second, the author proposed the criteria to identify effective constraints and integrate them to a system, as well as the algorithm to optimize vote assignment. Third, it described the lexical and grammatical regularities pertinent to quantification. Fourth, this paper formulated regularities using type description language (TDL) and then implemented them based on two Chinese computational grammars, namely ManGO and Zhong[|]. [Results] This new method effectively processed complex linguistic phenomena (e.g., quantification, binding and anaphora). [Conclusions] The proposed method could accelerate the development of linguistics, which provides technical support for artificial intelligence and deep linguistic processing.

Select

Evaluating PU Learning Based on Associative Classification Algorithm

Yang Jianlin,Liu Yang

Data Analysis and Knowledge Discovery. 2017, 1(11): 12-18. https://doi.org/10.11925/infotech.2096-3467.2017.0544

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] We examine the PU learning with the associative classification algorithm CBA. [Methods] First, we categorized α% of positive examples as unidentified positive examples, which were used to construct the corpus along with negative samples. Then, we classified examples based on all positive class association rules. Finally, we evaluated the reliability of class association rules with relative confidence. [Results] We used 0%, 30%, 60%, and 90% as the values of α. Compared to CBA, the AUC of the proposed PU learning algorithm were increased by 6.21%、11.15%、13.50% and 16.56%. Compared to POSC4.5, the AUC increased by 11.27%、15.03%、12.22%, and 7.37%. [Limitations] We did not modify the confidence of the class association rules based on the estimated proportion of positive examples. We found that the classification accuracy of the proposed PU learning algorithm gradually decreased while the value of α increased. We did not investigate the redundant rules of the CBA algorithm. [Conclusions] The proposed PU learning algorithm did better jobs than CBA and POSC4.5 algorithms.

Select

An Improved Method of Semantic Similarity Calculation of Chinese Trademarks

Zhai Dongsheng,Cai Wenhao,Zhang Jie,Li Zhenfei

Data Analysis and Knowledge Discovery. 2017, 1(11): 19-28. https://doi.org/10.11925/infotech.2096-3467.2017.0766

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to determine the semantic similarity of Chinese trademarks, aiming to meet the demands of judging trademark infringements. [Methods] First, we modified the HowNet based algorithm with new parameters to calculate the semantic similarity. Then, we retrieved a large number of trademark data to expand the coverage of HowNet. Third, we compared the performance of traditional and improved methods with the sample data. [Results] The modified algorithm could yield better results. [Limitations] The supporting data for similarity detection, i.e. trademark database, needs to be expanded. [Conclusions] The proposed method could effectively detect the semantic similarity of Chinese trademarks.

Select

Evaluating Online Healthcare Consultation Feedbacks Based on Signal Transmission Algorithm

Liu Tong,Yang Jingcheng

Data Analysis and Knowledge Discovery. 2017, 1(11): 29-36. https://doi.org/10.11925/infotech.2096-3467.2017.0566

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper designs and implements an unsupervised algorithm to evaluate the information accuracy of physicians’ feedbacks from online consulting service. [Methods] First, we identified word co-occurrence relationships based on large amount of online service records. Then, we built a statistical model to predict standard feedbacks for the given questions. Finally, we decided the accuracy of physicians’ answers by calculating content similarity between real feedbacks and the standard ones. [Results] We examined the proposed algorithm with records from Haodf.com as well as manually labeled results. The accuracy rates were 41.0% and 82.4% for rigorous and relax matching. [Limitations] We did not include the word sequence information in the algorithm. [Conclusions] The proposed algorithm could help patients know the accuracy of online medical information and improve their healthcare decisions makings.

Select

Recognizing and Analyzing Cited Spans in Literature

Xu Jian,Li Gang,Mao Jin,Ye Guanghui

Data Analysis and Knowledge Discovery. 2017, 1(11): 37-45. https://doi.org/10.11925/infotech.2096-3467.2017.0606

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes features of the cited document spans, and compares the effectiveness of several recognization techniques. [Methods] Firstly, we analyzed the annotated data of cited spans from CL-SciSumm 2016 for their length and position features as well as correlations with citation contexts. Then, we compared the effectiveness of bag-of-words, topic model, semantic dictionary (WordNet) methods by their performance of recognizing cited spans. [Results] We found that 96% of the annotated cited spans were less than three sentences, and most of the cited spans occurred in the front part of the whole paper or each chapter. The average TextRank weight of these cited spans was significantly higher than that of the regular spans. The length of these cited spans was correlated to the length of their corresponding sections, however, there was no obvious ties with the position features. The method based on bag-of-words was the most effective one, followed by the methods based on semantic similarity and topic model. [Limitations] Our discussion on the conception and characteristics of the cited spans are in theory. All data analysis was done with the annotation dataset of CL-SciSumm 2016. [Conclusions] The choice of words in scientific literature is very formal and rigorous, which makes the lexical features play an important role in recognizing the cited spans.

Select

Automatic Recognition of Legal Language Entities Based on Conditional Random Fields

Zhang Lin,Qin Ce,Ye Wenhao

Data Analysis and Knowledge Discovery. 2017, 1(11): 46-52. https://doi.org/10.11925/infotech.2096-3467.2017.0442

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to automatically identify the Legal Language Entities, which lays foundations for text mining of the Judgements. [Methods] First, we used a crawler to retrieve the needed data and manually marked the corpus. Then, we applied the NLPIR to load the legal field dictionary for corpus segmentation. Finally, we constructed the feature template based on the conditional random field and automatically recognize the Legal Language Entities. [Results] The conditional random field model with internal and external features of Legal Language could automatically identify the legal words, and its harmonic mean was over 90%. [Limitations] The proposed model has some limitations in field expansion. [Conclusions] It is feasible to automatically extract Legal Language Entities with the help of conditional random fields.

Select

Topic Representation Model Based on “Feature Dimensionality Reduction”

Liu Bingyao,Ma Jing,Li Xiaofeng

Data Analysis and Knowledge Discovery. 2017, 1(11): 53-61. https://doi.org/10.11925/infotech.2096-3467.2017.0707

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study aims to solve the high-dimensional and sparse issues facing traditional large-scale corpus analysis methods. [Methods] First, we used the probability of co-occurrence to represent the mutual information between words, and extracted combination of words with values higher than the threshold. Then, we constructed the initial network with the third level entries based on syntactic structure. Finally, we developed the text complex network with the correction algorithm to express topic semantics. [Results] We retrieved 6,936 micro-blog posts from the trending topic of “global outbreak of network ransomware” as experiment corpus, and built a network model with 217 nodes and 2,019 sides. We also explored micro-blogging topics with the new model. [Limitations] More research is needed on the network node weight assignments in text complex networks. [Conclusions] The proposed model could effectively reduce the redundancy of network nodes, and improve the semantic expression of topic complex network.

Select

Identifying Lead Players of User Innovation Communities Based on Feature Extraction and Random Forest Classification

Yuan Xinwei,Yang Shaohua,Wang Chaochao,Du Zhanhe

Data Analysis and Knowledge Discovery. 2017, 1(11): 62-74. https://doi.org/10.11925/infotech.2096-3467.2017.0694

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to identify the lead players of user innovation communities to promote the open innovation for enterprises. [Methods] First, we extracted features of the users from related content and behavior data of the innovation community. Then, we proposed a method to idenfity the lead users based on Random Forest classification model. Finally, we examine our new method with real data from the MIUI forum of Xiaomi community. [Results] The proposed method could identify the lead and non-lead users. [Limitations] Only examined our method with the MIUI forum, therefore, adjustments were needed to use it for other user innovation communities. [Conclusions] The proposed method could identify lead users from various online communities more efficiently and effectively.

Select

Linking Knowledge Elements from Online Community

Chen Guo,Xiao Lu

Data Analysis and Knowledge Discovery. 2017, 1(11): 75-83. https://doi.org/10.11925/infotech.2096-3467.2017.0752

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a system to link the fragmented knowledge elements from an online community, aiming to help explore knowledge more effectively. [Methods] First, we built a domain knowledge base for the online community. Then, we combined units of the domain knowledge base with the semantically similar elements of the user-generated-content (UGC). Finally, we identified the knowledge units of the UGC and linked them with relevant Web pages. [Results] We examined the proposed method with a Chinese cardiovascular BBS site. A total of 2,211 cardiovascular concepts and 5,741 fine-grained relations were extracted to create the domain knowledge base. We identified the knowledge elements from 5,020 posts automatically and linked them with relevant webpages. [Limitations] Only investigated the linking of knowledge elements at the micro level. [Conclusions] The proposed system can effectively establish connections between knowledge units and UGC documents based on the existing resource organization schemes. The new method could be used in other fields.

Select

Studying Dietary Preferences of Chinese Residents

Yue Zijing,Zhang Chengzhi,Zhou Qingqing

Data Analysis and Knowledge Discovery. 2017, 1(11): 84-93. https://doi.org/10.11925/infotech.2096-3467.2017.0782

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study investigates the dietary preferences of Chinese users from different regions to reveal the differences of dietary culture among them, and then provides suggestion to the catering industry. [Context] It took researchers long period of time to collect small amount of data of dietary preferences. With the development of social media, we could retrieve large-scale dietary information more effectively. [Methods] We collected user-generated content (UGC) from Dianping.com to explore their dietary preferences. [Results] Users’ dietary preferences were very different in the developed regions. Meanwhile, there was significant negative correlation between geographic distances and the similarities of users’ dietary preferences. Finally, users paid more attention to the taste, service and environment of the restaurants. [Conclusions] Research based on the user-generated content can reflect their dietary preferences and reveal the differences of dietary cultures.

Select

Evaluating Academic Credits of Scientific Research Project Leaders

Huai Mengjiao,Pan Yuntao,Yuan Junpeng

Data Analysis and Knowledge Discovery. 2017, 1(11): 94-102. https://doi.org/10.11925/infotech.2096-3467.2017.0646

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study proposes and examines a system to evaluate the academic credits of scientific project leaders. [Methods] First, we established the scientific credit evaluation system based on 7 experts’ suggestion. Then, we examined this system with 100 leaders of important scientific research projects using Fuzzy Comprehensive Appraised Method. [Results] The proposed method could assess the academic credit of scientific research project leaders effectively. [Limitations] Our new system was relatively simple, and the sample was not comprehensive. [Conclusions] The academic credit evaluation system is practical, and could help the administrators appraise the performance of scientific project leaders.

Please choose a citation manager

Content to export

25 November 2017, Volume 1 Issue 11

模态框（Modal）标题

Please choose a citation manager

Content to export

25 November 2017, Volume 1 Issue 11