[Objective] This paper reviews leading research methodologies and related studies on comparative opinion mining, and then provides useful guidance for future research. [Coverage] We retrieved 55 scholarly papers from the WoS, Google Scholar and CNKI. Using the keywords of “comparative opinion”, “comparative sentence” or “comparative relation”. [Methods] Based on the retrieved literature, we discussed the latest development in classification schemes, recognizing comparative sentences, extracting comparative relations and analyzing sentiments of the comparative opinion. [Results] Due to the finite sequence rules, it was difficult for us to further improve the performance of comparative opinion recognition techniques. Meanwhile, few studies focused on the latent comparative opinion, and the current technology could not extract the comparative elements effectively. More research was needed to conduct fine-grained sentiment analysis with comparative opinion. [Limitations] We did not examine different methods of comparative opinion mining. [Conclusions] This paper presents a framework for future studies. New research should focus on identifying and tracking potential competitors, analyzing competitive edges of products, as well as providing comparative reports for different products.
[Objective] This paper reviews the semantic text mining techniques for intelligence analysis. [Coverage] We surveyed the leading semantic text mining research on intelligence analysis from the last ten years and a few earlier studies. [Methods] We first discussed the semantic text mining methodologies and algorithms for words, sentences and paragraphs. Then, we analyzed these techniques from the perspective of topic evolution and applications of mining technologies. [Results] Compared to the traditional intelligence analysis methods, semantic text mining approaches could process unstructured data and deal with multi-layer structured data. [Limitations] Only reviewed the leading studies and their applications in the scientific field. [Conclusions] Semantic text mining improve the performance of traditional intelligence analysis systems and become the future direction of research methodology. More research is needed to enrich the outlier semantic resources.
[Objective] This study aims to build an evaluation analysis intelligent system consisting of evaluation sentences recognition, polarity identification and evaluation objects extraction. [Methods] We first researched Chinese evaluation Ontology. Then, established an evaluation analysis rule base based on the results of Ontology research. Finally, programmed into evaluation analysis intelligent system CUCsas. [Results] Taking 50,000 weibo messages (a total of 133,201 sentences) released by the 7th Chinese Opinion Analysis Evaluation Conference (COAE2015) as the experimental data, the precision, recall and F rates of evaluation sentences recognition and polarity identification of CUCsas were 0.83, 0.70 and 0.76 respectively, but the experimental result of evaluation objects extraction was poor. [Limitations] The system was short of new evaluation factors discovery and domain lexicons automatic construction modules. [Conclusions] A practical evaluation analysis intelligent system was basically built.
[Objective] This paper aims to propose an algorithm to build “Feature Items Ontology”. [Context] Trending topics online are constantly changing and involve extensive fields. The existing research on automatically creating Ontology is limited to specific areas, which cannot effectively process the dynamic trending topics. [Methods] First, we analyzed the contents of major events from the trending topics. Second, we designed an algorithm automatically generating the Ontology. Third, with the guidance of initial Ontology, proposed an evolutionary algorithm to track the changing topics. [Results] Using the case of “Wei Zexi and Baidu” as an example, we collected 11,174 Sina Weibo posts to conduct two rounds of experiment. We initially extracted 7,421 feature items, 39 key nodes, and 781 key relationships. For the evolutionary results, we got 24,564 feature items, 67 key nodes, and 1,818 key relations. The missing rates, the false positive rates, and the loss costs were 0.1261, 0.0964 and 0.5985, which were all better than those of the TF-IDF algorithm. [Conclusions] The “Feature Items Ontology” is more accurate than the single word Ontology description, and is easier to calculate the semantic similarity. It is an appropriate method to retrieve semantic information from the dynamic trending topics.
[Objective] This paper aims to predict co-authorship more effectively and reduce the information loss. [Methods] First, we constructed a paper-author bipartite network and its co-authorship counterpart in the field of library and information science. Second, we described the relationships among authors with the path-length of two and three from the bipartite network. Third, we used the logistic regression method to learn the influence of different factors. Finally, we predicted co-authorship in the paper-author bipartite network with various indictors. [Results] We found significant information loss in the change from the paper-author bipartite network to the co-authorship network. The logistic regression method was an appropriate way to learn the contributions of paths. The new indicators were more accurate and the predicted co-authorships could be interpreted more easily. [Limitations] We did not include the multiple paths methods to the present study and more research is needed to examine the proposed method in other areas. [Conclusions] Co-authorship prediction should be conducted in the paper-author bipartite network to reduce the information loss. The paths combination indicator in the paper-author bipartite network might be the most effective method to predict co-authorship, which could be applied to the patent-inventor bipartite network.
[Objective] This study investigates the co-occurrence of blog comment contributors, aiming to explore their roles in blog posts clustering. [Methods] We developed a method of two-step clustering. First, we constructed the co-occurrence matrix of the contributors from different blog posts and then transform it to a correlation matrix. Then finished the first-step clustering with the help of Affinity Propagation (AP) algorithm. Second, we calculated the terms’ position weight based on the centers of AP clustering, and then finished the second-stage blog post content clustering with K-means algorithm. [Results] The average precision and recall ratio of the proposed method were 0.66 and 0.57, which were significantly higher than those of the traditional ones. [Limitations] The blog comment contributors co-occurrence improved the quality of clustering, but it has limited value in blog posts with few comments. [Conclusions] The proposed method improves the quality of blog posts clustering by combining terms and contributors’ co-occurrence. The two-step clustering method is a better option to select the initial cluster centers of the K-means algorithm.
[Objective] This study aims to examine the creation and development of online news topics, and then to gauge the public opinion. [Methods] First, we introduced the manifold learning technology to analyze the news topics. Second, we explored the relations among the high dimensional topics from each time window, which were identified by the LDA model. Third, we clustered these topics and visualized the relations among them in the low-dimensional space. Finally, we analyzed the topic evolution with the help of social network theorem. [Results] The proposed method could effectively identify the topic evolution trends of news reports on China from CNN in 2015. [Limitations] We did not fully explore the impacts of time windows. [Conclusions] This study provides a new method to visualize the evolution of news report topics over a period of time, which avoids inaccurate description due to the changing of adjacent time windows.
[Objective] This study aims to identify microblog post topics, and then automatically extract high quality ones with the help of text clustering techniques. [Methods] We collected food related microblog posts from Sina Weibo as raw data, then applied text clustering and deep learning techniques to detect the target topics. First, we categorized the microblog posts by the four seasons in accordance with their publishing dates. Second, we created a vector space model and used text clustering method to retrieve candidate topics. Finally, we automatically identified the quality topics with deep learning technology. [Results] We automatically identified the high quality topics manually found by researchers, and their topic coverage values were all higher than 0.5. [Limitations] We decided the topic quality based on qualitative data. [Conclusions] The proposed method could extract high quality topics effectively. The retrieved topics reflect the distribution of food related microblog posts in the four seasons.
[Objective] This paper proposes a framework to effectively identify technology opportunities with anomaly detection technique. [Methods] First, we constructed a similarity matrix and conducted multidimensional scaling analysis. Second, we identified potential technology opportunity from patents based on a variety of anomaly detection algorithm. Finally, we extracted the possible breakthroughs with the help of TRIZ’s laws of technology system evolution. [Results] We analyzed patent data from the DII database and then identified technology opportunities in different phases of the laser lithography field. We found that technology opportunities identified by the proposed framework became mainstream technologies later. [Limitations] The objectiveness and accuracy of the new method needs to be improved. [Conclusions] The proposed framework based on anomaly detection could effectively identify technology opportunities.
[Objective] This study aims to build a CRF model with multiple features, which could automatically extract chemical and disease named entities from biomedical documents. [Methods] We compared the performance of popular named entity recognition features, including lexical features, domain knowledge features, dictionary matching features as well as unsupervised learning features, and then optimized the new model. [Results] We built the final CRF model with lexical features, dictionary matching features, unsupervised learning features and part of the domain knowledge features. The precision, recall, and F-score for chemical entities identification tasks were 97.33%, 80.76%, and 88.27, respectively. For disease entities, they were 84.20%, 81.96%, and 83.07%, respectively. [Limitations] Chemical and disease entities may interfere with each other while being identified simultaneously. The deleted domain knowledge features may contain valuable information. [Conclusions] This study proposed a new method to identify biomedical named entities, which could be further improved.
[Objective] This paper proposes a Subject-Based Early Warning System for Patents, which provides a solution to long-term project tracking, early warning analysis, and data reuse. [Methods] Subject-Based Early Warning System for Patents integrated some open source systems and tools (e.g.: DSpace, OpenRefine, ECharts, VOSviewer, etc.) and developed the functions of data storing, tracking, classifying, cleansing, analyzing and managing. [Results] First, we constructed the new system with the subject of extreme ultraviolet lithography. Second, we examined the feasibility and effectiveness of the new system. [Limitations] The data processing automation, data analysis indicators, and content mining need to be optimized. [Conclusions] The proposed system could track, manage and utilize patent information effectively.
[Objective] Proposed a new video watermarking algorithm to protect the copyright of online video resources from Libraries, Museums and Archives(LAM). [Context] The proposed algorithm can maintain the original visual quality of the videos, as well as meet the real-time demand of online copyright protection. [Methods] First, we defined the pixel values of 8-bit watermark image as index. Second, we embedded the index and actual watermark information to the images alternately, and then encrypted the watermark with Arnold transform technique. Finally, the watermark was embedded into video segments randomly selected by the keys with the help of quantization modulation. [Results] The proposed algorithm identified and verified the copyright message of the protected videos effectively, while the coefficient of NC was above 0.8, and the watermark could be extracted in about three seconds. [Conclusions] The new method could protect video copyrights, and promote the information sharing and service integration of LAM resources.