[Objective] his study explores the structure of user interest hierarchy, as well as its evolution laws, aiming to improve the quality of personalized information services. [Methods] First, we used the LDA topic model to retrieve the topics of users’ tags. Then, we calculated the tag’s degree of interests, which were combined with their topics to identify user’s interests. Finally, we created the “core-edge” structure for user’s interests based on the interest network to analyze the evolution laws of their hierarchy. [Results] The “core-edge” structure of user’s interests gradually converged and became stable with the determination of interest domain. The evolution of user interest hierarchy in time series mainly included three types: always in the core layer, the core layer faded to the edge layer, and the edge layer promoted to the core layer. [Limitations] More research is needed to predict user’s interests in future time nodes. [Conclusions] This proposed method could accurately evaluate the existing users’ dynamic interests, and the evolution laws of their hierarchy, which optimizes personalized information services.
[Objective] This paper builds a spatial-textual sentiment analyzing model based on multi-dimensional WaveCluster, aiming to analyze text sentiment and spatial position effectively. [Methods] First, we integrated several datasets from Yelp to build spatial-textual database. Then, we used lexicon-based sentiment analysis to generate feature vector. Third, we proposed a new method using Hybrid model, Textual-Spatial model, as well as multi-dimensional clustering model to analyze the data. [Results] We found that multi-dimensional clustering based on db2 or bior2.2 wavelet can recognize clusters more accurately than DBSCAN and K-means on spatial-textual feature mining. It also achieved the highest speed for data at 100 thousand to 10 million levels. [Limitations] We used unigram model for sentiment analysis, which cannot analyze sentences. [Conclusions] The proposed Textual-Spatial model could find out sentiment tendency distribution from spatial-textual data effectively. The Hybrid model provides a new approach for spatial-textual recommend system to calculate sentiment similarity and spatial proximity simultaneously.
[Objective] This paper aims to compare the impacts of Chinese word segmenters on the degree of matching between the corpus and the sentiment lexicons. [Methods] We used six Chinese segmenters to process the self-built corpus of book reviews, which were also filtered with four Sentiment Lexicons. Then, we calculated the coverage and the matchings of corpus to each sentiment lexicon, the negative word list and the degree word list. Finally, we computed the ratio of neutral corpus and low-frequency words to the lexicons. [Results] For different sentiment lexicons, the segmenters yielded various results in corpus-lexicon matching, proportion of low-frequency in lexicons, as well as proportion of neutral part in corpus. [Limitations] The corpus size needs to be expanded, and the sentence-level and rule-based testing need to be added. [Conclusions] The word segmenter has significant impacts on the matching between the corpus and sentiment lexicons.
[Objective] This paper tries to automatically identify commodity names from product descriptions, aiming to classifying items sold by Taobao. [Methods] First, we retrieved a large number of transaction records from Taobao. Then, we built an e-commerce commodity description dataset and labeled it manually. Third, we created a supervised machine learning algorithm based on the XGBoost model to extract names from product description. [Results] The precision and recall of the algorithm was 85% and 87% for 816 different items from 20,059 records. [Limitations] Categories of commodities in the test corpus need to be expanded. [Conclusions] Machine learning algorithm is an effective way to identify product names.
[Objective] This paper proposes a method to extract product characteristics from user comments, aiming to address the issues facing hedonic price research. [Methods] First, we extracted keywords from user comments. Then, we retrieved the product characteristics favored by consumers through keywords clustering, and established the hedonic price model. Finally, we examined the proposed model with the sales of new properties in Guangzhou. [Results] We found seven real estate characteristics of significant consumer preferences from the user comments. The degree of fitting of the model reached 0.760, the DW statistic was 2.013, and the correlation coefficient between user preferences and price of the real estates was 0.989. [Limitations] The experimental data was collected from real estate website only. [Conclusions] The new model based on users comments could accurately evaluate the price of products. It also helps us effectively avoid multiple collinearity problems between independent variables and further explore business and consumer behaviors.
[Objective] This paper proposes a new model to extract topic keywords, aiming to detect those low frequency words of high relevance. [Methods] First, we designed a topic keyword extraction method, which integrated the topic embedding and network structure analysis techniques. Then, we extracted the preliminary set of topic keywords based on the LDA model, and trained the word vector with Word2Vec model. Third, we built a network based on word vector similarity and identified the final topic keywords with the help of network structure analysis. [Results] The new method improved the average similarity between topic keywords by 14.75%. Our method extracted the low frequency keywords with high topic relevance more effectively than the LDA model. [Limitations] The sample size needs to be expanded, and the segmentation process requires more manual adjustments. More research is needed to quantitatively analyze the topic keywords. [Conclusions] Our method improves the abstracting and public opinion analysis.
[Objective] This paper tries to identify the trends of topic semantic evolution at different development stages. [Methods] First, we combined the LDA model and life cycle theory to propose an analysis method. It addressed three technical issues, such as filtering topics, calculating topic semantic similarity and identifying topic semantic evolution patterns of lithium ion battery techniques. [Results] We found that topic inheritance ran through the whole process of discipline development. The topic splitting started at the growth stage and achieved 6 at the fast development stage. The topic merging began at the development stage and reached 5 at the fast development stage. [Limitations] More research is needed to determine whether the overall topics can cover all phases of the developments. The knowledge map of topic semantic evolution also needs to be created automatically. [Conclusions] The proposed method could identify key semantic evolution patterns such as inheritance, division and merging in the development stages. It provides valuable decision-making information for the knowledge innovation.
[Objective] This paper tries to build domain ontology for intelligent applications, aiming to enhance the capability of domain knowledge representing and application development. [Methods] We proposed the application-driven circulation method to model cross-domain knowledge based on the demands of intelligent applications. It has the structure of “requirement + construction + evaluation”, so that requirements play leading role in ontology construction. We took the field of anti telephone fraud as an example, and constructed the anti-fraud ontology of the intelligent requirements. [Results] Our anti-fraud domain ontology represented a wide range of cross-domain knowledge and effectively supported intelligent anti-fraud applications, which were based on the semantics of fraudulent calls. [Limitations] More research is needed to examine the requirements of intelligent applications. [Conclusions] The proposed method promotes more research in domain ontology construction and anti-fraud methods.
[Objective] This paper analyzes the differences in the importance of database items, aiming to address the issues of traditional association mining algorithm with redundant and worthless rules. [Methods] On the sequence with temporal constraints, we explored the non-weighted association rules with the frequency effective length and the weighting methods. Then, we used sliding window technique to study the rare weighted association rules on the time series. [Results] The accuracy of the prediction made by the proposed method increased to 69% from 62%. [Limitations] The mining algorithm took long time to extract the needed rules due to the sliding windows and the large number of rules generated. [Conclusions] The association rules of weighted time series improve the accuracy of recommendation, which also provides new directions for research method on association rules.
[Objective] This paper proposes a new algorithm for influence maximization based on overlapping community, called IM-BOC algorithm, aiming to the low efficiency of greedy algorithm. [Methods] This method selects candidate seed set by combing propagation degree and k-core firstly, then it utilizes CELF algorithm to ensure the optimal seed set, which can improve both efficiency and accuracy. [Results] The experimental results show that running time of our algorithm can improve about 89% when facing Amazon dataset. [Limitations] Our IM-BOC algorithm allocates the number of candidate seeds only according to the number of community nodes, which has insufficient theoretical evidence. [Conclusions] IM-BOC algorithm is applicable to large scale networks under the premise of ensuring the influence spread.
[Objective] This paper aims to automatically grade reading difficulty of textual documents. [Methods] We used machine learning method based on multiple features of the texts to decide their difficulty levels automatically. The features, which include word-frequency, structures, topics, and depth, describe the textual contents from different perspectives. [Results] We evaluated our method with the reading comprehension texts for high-school English exams, and achieved an accuracy of 0.88. Our result is better than those of the traditional difficulty classification methods. [Limitations] Due to the high cost of manual annotation, the existing datasets cannot be used to improve our method. [Conclusions] The proposed method increased the effectiveness of machine leanring based data analysis.
[Objective] This paper tries to address the issues facing sci-tech big data, such as source dispersal, low quality, and poor content. [Methods] We used value-added computing methods, such as data cleansing, entity alignment, entity field fusion, conflict detection, etc., to develop tools for the enrichment of sci-tech big data. [Results] The developed tools achieved entity data alignment at the levels of personnel, organization, conference, journal and relationship among them. The contents of the entity fields were increased by 5 to 10 times, and the entity analysis dimension was increased by 2 to 3 times. [Limitations] The timeliness and standardization of value-added data need to be optimized and improved based on service needs. [Conclusions] The proposed methods and tools enhance the knowledge discovery of the sci-tech big data and intelligent information analysis systems.
[Objective] This paper studies the annotation method for Chinese electronic medical records, aiming to improve the processing of massive clinical texts and clinical knowledge discovery. [Methods] First, we proposed annotation method for Chinese e-medical records, and constructed a visual interactive platform. Then, based on the word and phrase features of these records, we identified the medical name entities with natural language processing and machine learning approaches. [Results] A total of 700 annotated records were obtained, and the overall F value of the Pipeline-based annotation method reached 0.8772, which was 32.9% higher than those based on the original medical records. [Limitations] Since the electronic medical record contains sensitive privacy information, this study was conducted with open dataset, and the corpus size was limited. [Conclusions] The Chinese electronic medical record annotation method and platform constructed in this study could effectively process clinical texts, and the association of medical knowledge.