.
Abstract—
Topic modeling on tweets is known to experience under-specificity and data sparsity due to its character limitation.
In earlier studies, researchers attempted to address this problem by either 1) tweet aggregation, where related tweets are combined into a single document or 2) tweet expansion with related text from external sources.
The first approach faces the problem of losing the topic distribution in individual tweets.
While finding a relevant text from the external source for a random tweet in the second approach is challenging for various reasons like differences in writing styles, multilingual content, and informal text.
In contrast to adding context from external resources or combining related tweets into a pool, this study uses the internal vocabulary (hashtags) to counter under-specificity and sparsity in tweets.
Earlier studies have indicated hashtags to be an important feature for representing the underlying context present in the tweet. Sequential models like Bi-directional Long Short Term Memory (BiLSTM) and Convolution Neural Network (CNN) over distributed representation of words have shown promising results in capturing semantic relationships between words of a tweet in the past.
Motivated by the above, this article proposes a unified framework of hashtag-based tweet expansion exploiting text-based and network-based representation learning methods such as BiLSTM, BERT, and Graph Convolution Network (GCN).
The hashtag-based expanded tweets using the proposed framework have significantly improved topic modeling performance compared to un-expanded (raw) tweets and hashtag-pooling-based approaches over two real-world tweet datasets of different nature.
Furthermore, this article also studies the significance of hashtags in topic modeling performance by experimenting with different combination of word types such as hashtags, keywords, and user mentions.
.
No comments:
Post a Comment