基于IDSSL的文本情感分析研究 本期目录 >>
Title: Study of Text Sentiment Analysis Based on IDSSL
作者 王刚;李宁宁;杨善林
Author(s): Wang Gang; Li Ningning; Yang Shanlin
摘要: 随着社交媒体的不断普及,网络上出现了大量用户创造的文本信息。这类文本所包含的用户的观点、意见和态度等情感信息,对于互联网用户有着重要的作用,已受到越来越多的重视,并已提出大量有监督的文本情感分析方法来利用这些数据。但文本情感分析中存在大量无标记样本,如何利用大量无标记样本和少量有标记样本进行学习的问题,已成为了文本情感分析领域亟待解决的问题之一。为此,本文提出一种改进的半监督文本情感分析方法IDSSL(Improved Disagreement-based Semi-Supervised Learning)。该方法以基于分歧的半监督方法为框架,首先利用Random Subspace的方式构建多个初始分类器,然后以“多数帮助少数”的方式利用无标记样本训练分类器。最后,在情感分析经典数据集上进行了实验,结果证明了本文提出的方法的有效性,而且取得了比其它半监督学习方法都好的实验结果。
Abstract: With the growing popularity of social media, a large number of user generated contents are posted in the Internet. These kinds of texts contain the user's point of views, opinions and attitudes, which play an important role for the Internet users. Researchers pay more and more attentions to the user generated contents. Subsequently, a lot of supervised text sentiment analysis methods have been proposed to make use of this kind of data. However, there are a lot of unlabeled data in the sentiment analysis. How to use a large number of unlabeled data and a small amount of labeled data has become one of the urgent research problems in the area of sentiment analysis. Therefore, this paper proposed an Improved Disagreement-based Semi-Supervised Learning (IDSSL) method for text sentiment analysis, which based on the framework of disagreement-based semi-supervised learning. Firstly, a model for sentiment analysis based on the disagreement-based semi-supervised learning was constructed. First of all, the disagreement-based semi-supervised learning was theoretically analysis. It found that the multiple-classifiers method is better than original disagreement-based semi-supervised learning. On the other hand, diversity is the key value of the multiple-classifier disagreement-based semi-supervised learning. Moreover, Random Subspace method can lead more diversity of the classifiers in the area of sentiment analysis. Therefore we constructed sentiment analysis model by combining multiple classifiers method produced with Random Subspace method, namely IDSSL method. IDSSL method consists of three steps: (1) multiple initial classifiers are built based on the Random Subspace method; (2) classifiers are trained by the rule of “majority help minority” to utilize the unlabeled instances; (3) the base classifier was integrated in majority vote. Secondly, experiments were carried out using the classic datasets of sentiment analysis. The established standard measure in sentiment analysis, average accuracy, was adopted to evaluate the performance of the proposed method. IDSSL method is compared with several semi-supervised learning method, including Self-training method、Co-training method and Tri-training method .IDSSL method and other semi-supervised learning methods are composed of one base learner, SVM, which is common used in the sentiment analysis area. To minimize the influence of variability in the training set, 10-fold cross validation was performed five times on the sentiment analysis datasets. Finally, experimental results proved the effectiveness of our proposed method. Moreover, our proposed method obtained the better results than the other semi-supervised learning methods, including Self-training method、Co-training method and Tri-training method. In addition, we also discuss the different semi-supervised learning methods’ result, the influence of the label rate to the semi-supervised learning methods, and the influence of the add-number to the IDSSL method.
关键词: 文本情感分析;半监督学习;多分类器;Random Subspace
Keywords: Text Sentiment Analysis; Semi-supervised Learning; Multi Classifiers; Random Subspace
基金项目: 国家自然科学基金;国家自然科学基金;高等学校博士学科点专项科研基金;中国博士后科学基金;中国博士后科学基金
发表期数: 2018年 第3期
中图分类号: 文献标识码: 文章编号:
参考文献/References:

[1] 中国互联网信息中心(CNNIC).第34次中国互联网统计报告[EB/OL]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201407- /P020140721507223212132.pdf, 2014.

[2] 张紫琼, 叶强, 李一军. 互联网商品评论情感分析研究综述 [J]. 管理科学学报, 2010, 13(6): 84-96.

[3] 王刚, 王珏, 杨善林. 电子商务中基于非均衡数据分类和词性分析的意见挖掘研究 [J]. 情报学报, 2014, 33(3): 313-325.

[4] Pang B, Lee L. Opinion mining and sentiment analysis [J]. Foundations and trends in information retrieval, 2008, 2(1-2): 1-135.

[5] Pang B, Lee L, Vaithyanathan S. Thumbs up?: sentiment classification using machine learning techniques [A]. Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10[C]. Association for Computational Linguistics, 2002: 79-86.

[6] Wang G, Sun J, Ma J, et al. Sentiment classification: the contribution of ensemble learning [J]. Decision Support Systems, 2014, 57(1): 77-93.

[7] 罗彬, 邵培基和罗尽尧等. 基于多分类器动态集成的电信客户流失预测 [J]. 系统工程学报, 2010, 10(5): 703-711.

[8] Zheng X, Zhu S, Lin Z. Capturing the essence of word-of-mouth for social commerce: Assessing the quality of online e-commerce reviews by a semi-supervised approach [J]. Decision Support Systems, 2013, 56: 211-222.

[9] Yu N. Exploring Co-training strategies for opinion detection [J]. Journal of the Association for Information Science and Technology, 2014, 65(10): 2098-2110.

[10] 周志华, 王珏. 半监督学习中的协同训练风范 [J]. 机器学习及其应用, 北京: 清华大学出版社, 2007: 259-275.

[11] Zhou ZH, Li M. Semi-supervised learning by disagreement [J]. Knowledge and Information Systems, 2010, 24(3): 415-439.

[12] Chapelle O, Schölkopf B, Zien A. Semi-Supervised Learning[M]. Cambridge, MA, USA: MIT Press, 2006.

[13] Zhu X. Semi-supervised learning literature survey [R]. Technical Report 1530, University of Wisconsin-Madison: 2006.

[14] Turney, PD, Littman ML. Measuring praise and criticism: Inference of semantic orientation from association [J]. ACM Transactions on Information Systems (TOIS), 2003, 21(4): 315-346.

[15] Jin W, Ho HH, Srihari RK. OpinionMiner: a novel machine learning system for web opinion mining and extraction [A]. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining [C], 2009: 1195-1204.

[16] Wan X. Co-training for cross-lingual sentiment classification [A]. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 [C], 2009: 235-243.

[17] Li S, Huang CR, Zhou G, et al. Employing personal/impersonal views in supervised and semi-supervised sentiment classification [A]. Proceedings of the 48th annual meeting of the association for computational linguistics [C], 2010: 414-423.

[18] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training [A]. Proceedings of the 11th annual conference on Computational learning theory[C]. ACM, 1998: 92-100.

[19] Nigam K, Ghani R. Analyzing the effectiveness and applicability of co-training[A]. Proceedings of the 9th international conference on Information and knowledge management[C]. ACM, 2000: 86-93.

[20] Goldman S, Zhou Y. Enhancing supervised learning with unlabeled data[A]. Proceedings of the 17th International Conference on Machine Learning,[C] San Francisco, CA, 2000: 327-334.

[21] Zhou ZH, Li M. Tri-training: Exploiting unlabeled data using three classifiers [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529-1541.

[22] Li M, Zhou ZH. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples [J]. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2007, 37(6): 1088-1098.

[23] 李毓, 张春霞. 基于out-of-bag样本的随机森林算法的超参数估计[J]. 系统工程学报, 2011, 26(4): 566-572.

[24] Breiman L. Bagging predictors [J]. Machine Learning, 1996, 24(2): 123-140.

[25] Wang W, Zhou ZH. Analyzing co-training style algorithms [A]. Proceedings of the 18th European Conference on Machine Learning[C]. Berlin, Heidelberg: Springer-Verlag, 2007: 454-465.

[26] Ho TK. The random subspace method for constructing decision forests [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8): 832-844.