Mahmood: Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

***

`Abstract:

Text mining refers generally to the process of

extracting generally to the process of extracting

interesting and non-trivial and knowledge from

unstructured text data. Text mining is

interdisciplinary field which draws on information

retrieval, data mining, machine learning, statistics

and computational linguistics. Standard text mining

and information retrieval techniques of text

document usually rely on word matching. An

alternative way of information retrieval is

clustering. In which document pre-processing is an

important critical step in the clustering process and

it has a huge impact on the success extract

knowledge.

Document clustering is a technique used to group

similar documents. During the course of the project

we implement tf-idf and singular value

decomposition dimensionality reduction

techniques. We proposed an effective preprocessing and dimensionality reduction

techniques which helps the document clustering.

Finally we have chosen one dimension reduction

technique that performed best both in term of

clustering quality and computational efficiency.

***

Mahmood