Big data analytics is often prohibitively costly. It is typically conducted by parallel processing with a cluster of machines, and is considered a privilege of big companies that can afford the resources. This position paper argues that big data analytics is accessible to small companies with constrained resources. As an evidence, we present BEAS, a framework for querying big relations with constrained resources, based on bounded evaluation and data-driven approximation.
[Objective] This study explores the word embedding representation features for entity relationship extraction, aiming to add semantic message to the existing methods. [Methods] First, we used the feature characteristics at word embedding representation, the vocabulary and the grammar levels to extract relations using Naive Bayesian, Decision Tree and Random Forest models. Then, we obtained the optimal subset of the full features. [Results] The accuracy of the Decision Tree algorithm was 0.48 with full features, which was the best. The F1 score of Member-Collection (E2, E1) was 0.70, and the dependency could help us extract the relations. [Limitations] We need to improve the relation extraction results with small sample size and complex situation. The word vector training method could be further optimized. [Conclusions] This study proves the effectiveness of three types of features. And the word embedding representation level feature plays an important role to extract relations.
[Objective] This article examines online reviews attracting more positive votes from consumers, aiming to identify those high quality reviews based on the information adoption and negative bias theories. [Methods] First, we retrieved 12 393 reviews on cellphones from Amazon.cn. Then, we investigated the impacts of the review’s characteristics on the numbers of positive votes with the help of zero inflated negative binomial regression and text analysis methods. The characteristics we studied include reviewer’s credibility, review’s quality and extremity. [Results] The usefulness of the reviewer’s previous posting, the information quality of the reviews, the number of comments, the extreme ratings, and the negative level of the reviews helped them receive more positive votes. However, the reviewers bought the products or not, and the number of the previously posted reviews had negative influence on the number of votes. [Limitations] Only investigated cellphones in this study. [Conclusions] This paper helps E-commerce websites improve their review ranking algorithms.
[Objective] This paper proposes a personalized product recommendation model based on tags in the social e-commerce environment. [Methods] First, we calculated users’ interests and preferences with the help of tagging frequency and time. Then, we constructed a product ontology of the commercial community based on the tag features and searching conditions of the e-commerce website. Third, we used the ontology to standardize tag semantics, and to classify goods. Fourth, we found clusters containing user preferences, and calculated the similarity between their tags of goods and user preference in the cluster. Finally, we identified the goods which were not tagged but preferred by a specific user. [Results] We examined the model with information of 200 randomly selected active users of popular items from the website of FanDongXi. [Limitations] Only used the frequency and time factor of the users’ tags to calculate their interests and preferences. [Conclusions] The proposed method has better performance than the collaborative filtering recommendation based methods.
[Objective] The article tries to objectively evaluate the influence of China’s webcast Platforms with the help of link analysis. [Methods] First, we used Google search engine and Alexa.com to collect the link data of 20 popular webcast platforms in China. Then, we examined their influence with a modified grey correlation analysis method. [Results] We obtained the ranking of 20 webcast platforms and analyzed their characteristics. [Limitations] We could not obtain comprehensive data from the webcast platforms and the smaple size was limited. [Conclusions] The overall level of current webcast platform is not so good. This article proposes strategies to increase the influence of webcast platforms.
[Objective] This study proposes and examines a new method to identify the communities in collaboration network of scientific researchers. [Methods] First, we retrieved the need data from information science journal articles published from 2012 to 2016. Then, we used the Automatic Relevance Determination to find the target community with the Bayesian Symmetric Non-negative Matrix Factorization method. Finally, we compared the performance of our method with the existing ones. [Results] The proposed method got better results than others. [Limitations] Did not optimize our data with the researcher identifications. [Conclusions] The proposed method could effectively find communities from the scientific collaboration network.
[Objective] This paper aims to remove the unrelated information from the official Weibo (micro-blog) profiles, and then retrieves the posts on official events. [Methods] First, we used the word2vec machine learning model to train the official Weibo datasets. Then, we proposed an official micro burst words detection method based on the influence of Weibo posts, the base weight and the related official profiles. Third, we calculated the similarity of blog posts with the burst words, and used hierarchical clustering algorithm to select burst words for the target events. [Results] The proposed algorithm had better precision (63.5%), recall (85.5%) and F values (0.73) than the traditional TF-IDF and TextRank algorithms. [Limitations] The official profiles did not have enough historical data on the events. [Conclusions] The burst words help us detect official events effectively from the official Weibo profiles.
[Objective] This paper tries to identify the opinion leaders of Weibo and examines their roles in information dissemination. [Methods] We adopted, a method of two-step clustering to identify opinion leaders of the “illegal vaccine” event. Then, we created a network matrix for these opinion leaders based on their relationship. Finally, we analyzed the sentiments of the Weibo users to evaluate the role of opinion leaders’ network. [Results] The overall users’ sentiments was negative. The opinion leaders’ network posed significant impacts on the sentiments of average users. [Limitations] Only examined our method with one event. [Conclusions] The celebrities and opinion leaders play important role to sway the public opinion online.
[Objective] This paper analyzes online reviews to identify the patterns of their topic contents and sentiments. [Methods] First, we obtained the sentiment of the reviews with the SSTM model. Then, we proposed a DSTM model based on the document, document sentiment distribution and words. Finally, we estimated the distribution of sentiment-topic and the keywords. [Results] We modeled the review datasets by time slice and found the changing trends of contents and sentiments over time. [Limitations] The proposed model did not include the relationship among different subjects, which might generate errors. [Conclusions] The DSTM model, which integrates the external time features, can effectively analyze the evolution of online review topics.
[Objective] This paper aims to reduce the noises while extracting product features from customer comments. [Methods] We used the TF-IDF and variance selection methods to extracted the needed data. Then, we set the thresholds to filter the extracted words and obtain the product feature set. Third, we generated frequent item sets with the Apriori algorithm. Finally, we defined various thresholds to obtain the optimal sets, which automatically extracted product features from user comments. [Results] We examined the effectiveness of the proposed method with comment texts on mobile phone products. Comparing the automatically extracted characteristics with the manually identified characteristics, we found that the precision P value was 72.44%, the recall R value was 77.59%, and the comprehensive F value reached 74.93%. [Limitations] The precision needs to be improved and there might be some human errors involving the manually identified terms. [Conclusions] The Apriori algorithm could help us extract product features effectively.
[Objective] The paper aims to expand the supporting ability of the CSpace Institutional Repository for audios and videos. [Context] The ever-growing audios and videos resources, require us to expand the Institutional Repository’s supporting ability, which help us retrieve knowledge and increase their academic values more effectively. [Methods] First, we analyzed the needs of users and the developments of Institutional Repository’s audios and videos supporting services at home and abroad. Then, we constructed an extension framework for the supporting functions. Finally, we chose the key technologies and methods to build the experimental platform, and explored its feasibility in CSpace. [Results] The proposed method helped us change audios and videos clips’ formats, analyze video scenes and develop a video player with scene navigation functions. [Conclusions] The transcoding technology for audios and videos works effectively. However, other supporting functions could be further improved. The format conversion technology for audios and videos in CSpace could expand its supporting services.