An HMM-Based Text Classifier Less Sensitive to Document Management Problems

Adrián   S.   Vieira; Eva   L.   Iglesias; Lourdes   B.   Diz

doi:10.2174/1574893611666160617094720

Abstract

Background: The performance of the text classification techniques is commonly affected by the characteristics and representation of the document corpora itself. Of all the problems arising from the corpus, there are three major difficulties which the classifiers must deal with: the feature selection issues, the class imbalance problem and the size of the training set.

Objective: The objective of this paper is to present a novel based-content text classifier called T-LHMM that is less sensitive to the text representation and the size of the corpus, and more efficient in terms of running time than other classification techniques.

Method: In order to demonstrate it, we present a set of experiments performed on well-known biomedical text corpora. We also compare our classifier with k-Nearest Neighbours and Support Vector Machine models.

Results and Conclusion: The experimental and statistical results show that the proposed HMM-based text classifier is indeed less sensitive to the class imbalance, the size of the corpus and the vocabulary than the other classifiers. In addition, it is more efficient in terms of running time than k-NN and SVM techniques.

Keywords: Based-content text classification, class imbalance, feature selection, Hidden Markov Model.

« Previous Next »

Graphical Abstract

Rights & Permissions Print Cite

Article Metrics

35

2

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893611666160617094720	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

An HMM-Based Text Classifier Less Sensitive to Document Management Problems

Abstract

Graphical Abstract

Related Journals

Related Books