论文部分内容阅读
为提高文本分类的精度,Schapire和Singer尝试了一个用Boosting来组合仅有一个划分的简单决策树(Stumps)的方法.其基学习器的划分是由某个特定词项是否在待分类文档中出现决定的.这样的基学习器明显太弱,造成最后组合成的Boosting分类器精度不够理想,而且需要的迭代次数很大,因而效率很低.针对这个问题,提出由文档中所有词项来决定基学习器划分以增强基学习器分类能力的方法.它把以VSM表示的文档与类代表向量之间的相似度和某特定阈值的大小关系作为基学习器划分的标准.同时,为提高算法的收敛速度,在类代表向量的计算过程中动态引入Boosting分配给各学习样本的权重.实验结果表明,这种方法提高了用Boosting组合Stump分类器进行文本分类的性能(精度和效率),而且问题规模越大,效果越明显.
To improve the precision of text classification, Schapire and Singer have tried a method of combining Boosting with simple one simple stumps whose division is based on whether a particular item is in the document to be classified It is decided that such a basic learner is obviously too weak, resulting in the final combination of Boosting classifier accuracy is not ideal, and the need for a large number of iterations, and therefore inefficient.To address this issue, all the terms from the document The method that decides the division of learner to enhance the ability of learner classification based on the similarity between the document represented by VSM and the class representative vector and the size of a certain threshold is used as the basis for learner classification.At the same time, The speed of convergence of the algorithm and the weights assigned to the learning samples by Boosting are introduced dynamically in the process of class representative vector.The experimental results show that this method improves the performance (accuracy and efficiency) of text classification by Boosting combined Stump classifier, And the greater the scale of the problem, the more obvious the effect.