Social Media Text Classification by Enhancing Well-Formed Text Trained Model

Phat Jotikabukkana; Virach Sornlertlamvanich; Okumura Manabu; Choochart Haruechaiyasak

doi:10.5614/itbj.ict.res.appl.2016.10.2.6

Social Media Text Classification by Enhancing Well-Formed Text Trained Model

Published: 2016-08-31 | DOI: 10.5614/itbj.ict.res.appl.2016.10.2.6 | Vol 10, Issue 2 (2016) | Cited by: 8

Authors

Phat Jotikabukkana

School of ICT, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani,

Virach Sornlertlamvanich

School of ICT, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani,

Okumura Manabu

Tokyo Institute of Technology, Ookayama Campus, Ookayama Meguro-ku, Tokyo,

Choochart Haruechaiyasak

National Electronics and Computer Technology Center, Thailand Science Park, Pathum Thani

Abstract

Social media are a powerful communication tool in our era of digital information. The large amount of user-generated data is a useful novel source of data, even though it is not easy to extract the treasures from this vast and noisy trove. Since classification is an important part of text mining, many techniques have been proposed to classify this kind of information. We developed an effective technique of social media text classification by semi-supervised learning utilizing an online news source consisting of well-formed text. The computer first automatically extracts news categories, well-categorized by publishers, as classes for topic classification. A bag of words taken from news articles provides the initial keywords related to their category in the form of word vectors. The principal task is to retrieve a set of new productive keywords. Term Frequency-Inverse Document Frequency weighting (TF-IDF) and Word Article Matrix (WAM) are used as main methods. A modification of WAM is recomputed until it becomes the most effective model for social media text classification. The key success factor was enhancing our model with effective keywords from social media. A promising result of 99.50% accuracy was achieved, with more than 98.5% of Precision, Recall, and F-measure after updating the model three times.

Keywords

Computer science; Data mining; Profit (economics); Margin (machine learning); Product (mathematics); Machine learning; Mathematics

1 Introduction

In these days, social media have a huge impact on our life in many dimensions: socialization, business, politics, etc. It is the most popular digital communication tool to spread ideas and information. The present number of active social media accounts is around 2 billion out of all 7.2 billion people in the world [1]. This novel source of data is very attractive for researchers and decision makers. In our research, we focused on Twitter as a social media text source for two reasons. Firstly, Twitter is one of the most popular social media applications and it is the fastest growing one. Referring to recent statistics, there are 316 million monthly active users who generate 500 million Twitter messages (tweets) per day [2]. Twitter had 95% growth in active users and 35% growth in members through 2014 [3]. These are significant numbers, which means that we will gain a lot of benefit if we can extract the essence from this data type. Secondly, the data structure of Twitter and their support API are convenient for researchers to operate with. Tweets are text files limited to 140 characters presented in JSON file format [4] and a Twitter search API can be used to retrieve tweets with a rate limited to 180 queries per 15-minute window [5] and a seven-day search back [6].

The main challenge is to analyze social media text. Since tweets are short text messages, they look like colloquial text compared to written documents. The data stream contains a large amount of noisy and unstructured information, informal language, slang and missing words. This makes text classification in order to distinguish categories before extracting useful information very difficult. In our experiment we therefore applied a technique to extract keywords using Term Frequency-Inverse Document Frequency (TF-IDF) and Word Article Matrix (WAM) to expand the set of keywords reflecting the nature of the texts retrieved from Twitter. We collected data and created word vectors from an online news source consisting of well-formed text, which was already categorized beforehand by publishers to extract keywords and classify information from Twitter. Semi-supervised machine learning can solve selfassigned classes labeling for topic classification problems. The computer can automatically extract categories from the news website for use as proper classlabels with a sense of human familiarity, such as Economic, Entertainment, Foreign, Information Technology (IT), Politics, Regional, Sports, etc. Finally, we get a productive set of keywords from the official site and Twitter, which can be representative for text categories. New words, abbreviations and argot that never appear in well-formed documents are extracted from the tweet messages, which can then be used as the main keywords that reflect interesting topics in society at a certain moment in time.

This paper is organized as follows. Related research works are discussed in Section 2. In Section 3, we explain our approach and the main techniques that were used. The experimental results are shown in Section 4. Finally, in Section 5, we draw the overall conclusions from the experiment, including a brief discussion of future work regarding the effectiveness of social media text classification.

A number of recent papers address social media text classification. Irfan, et al. [7] reviewed different text mining techniques to discover various textual patterns from the social web. Text mining using classification with various machine learning-based and ontology-based algorithms and a hybrid approach were reviewed. There is no algorithm that performs best for all kinds of data sets. For better performance of the hybrid approach, several parameters need to be defined in advance. Patel, et al. [8] reviewed different types of classifiers for text classification with an eye on their advantages and disadvantages. Six different algorithms were reviewed: Bayesian Classifier, Decision Tree, Knearest neighbor (K-NN), Support Vector Machine (SVM), Neural Network, and Rocchio's. The common disadvantage of all algorithms is their performance limitations. Some of them are easy to implement but their performance is very poor. Some of them perform greatly but need extra time for training and parameter tuning. Lee, et al. [9] classified Twitter Trending Topics with two approaches for topic classification: the well-known Bag-of-Words approach for text classification and network-based classification. They identified 18 classes and classified trending topics into these categories. In the final result, the network-based classifier performed significantly better than the text-based classifier. Kateb, et al. [10] discuss methods that overcome problems in classifying short texts from streaming data in social media. In the classification techniques section they present some common issues that are useful in general to address before conducting text classification: 1. Define the research goal. 2. Does speed matter? 3. What is the size of the data? On the basis of these simple considerations we selected a suitable technique (classification, regression or clustering) and suitable algorithm for conducting our experiment.

Chirawichitchai, et al. [11] compared six methods of feature weighting in a Thai document categorization framework. They found that ltc weighting with SVM yielded the best performance for Thai document categorization. Theeramunkong, et al. [12] proposed a multidimensional framework for classifying text documents. Classifying text documents based on a multidimensional category model by using multidimensional-based and hierarchy-based classifications beat flat-based classification. Viriyayudhakorn, et al. [13] compared four divergent thinking support engines using associative information extracted from Wikipedia. They used Word Article Matrix (WAM) to compute the association function. This is a useful and effective technique for divergent thinking support. Sornlertlamvanich, et al. [14] proposed a new method for fine-tuning a model trained with some known documents containing richer context information. They used WAM to classify text and track keywords from social media to understand social developments. WAM with cosine similarity measure is an effective method of text classification.

The related literature revealed that there are many different techniques for social media text classification. All algorithms still have complex issues related to their performance. This inspired us to adapt some useful techniques in a novel, simple way to effectively classify social media text with a sense of human familiarity.

3 Experiment

Viriyayudhakorn, et al. [13] and Sornlertlamvanich, et al. [14] used WAM only for their specific purposes (divergent thinking support and keyword tracking), while the present study focused on using WAM to classify social media text with additional techniques of text class self-learning (semi-supervised learning) and enhancing the WAM model with specific keywords from social media until it becomes the most suitable model for social media text classification.

The rest of this paper consists of two main parts: first, the main techniques that were used are discussed. Second, our approach for effective social media text classification is explained.

3.1 Main Techniques

3.1.1 Web Crawler

For the initial state, we need to retrieve news articles from an online news source consisting of well-formed text. A web crawler, also known as a robot or a spider [15], is the main module to get access to the data source. Because most websites today are implemented with hypertext markup language (HTML), extensible markup language (XML) and cascading style sheet (CSS), the structure of the targeted website must be verified. Then, the uniform resource locator (URL), the news category and the news article part, which are needed as main parameters, have to be specified. Afterwards, these parameters are applied through the XML Path (XPath) query technique to retrieve the demanded data, i.e. the online news articles. The RapidMiner software application [16] was used as the main web crawler module.

As an example, the XPath query command was used to automatically extract the news category (sports) from the part "<body class= "single category-sports topnav">" as shown in Figure 1. This is an effective semi-supervised learning method. The computer can extract the news category even if the publisher decides to change the category label related to an article.

Figure 1 Example code of an online news article (from http://dailynews.co.th/) [17].

3.1.2 Word Segmentation

The experiment in this study was conducted on social media text written in the Thai language. Word segmentation is a crucial factor in text mining. However, the Thai language is written without spaces between words. Therefore, a word segmentation module was used, applying the maximal matching algorithm to determine the word boundaries [18]. The recent word list in the dictionary was updated before the research was conducted. Consequently, the segmentation result was acceptable for determining the essential words for further processing in keyword identification.

3.1.3 Term Frequency-Inverse Document Frequency (TF-IDF)

At present, there are many weighting schemes for text mining: Boolean weighting, Term Frequency (TF) weighting, TF-IDF weighting, tfc weighting, ltc weighting, and Entropy weighting [11]. TF-IDF is the most widely used technique to extract keywords from documents. It is composed of 2 steps: Term Frequency (TF) and Inverse Document Frequency (IDF). TF is computed from the number of times a word appears in a document, divided by the total number of words in that document. It can be defined as a counting function in Eq. (1) [19].

\[TF(t,d) = \sum_{x \in d} fr(x,t)\] (1)

TFሺt, dሻ is actually the total number of term t appearing in document d and frሺx, tሻ is a simple function defined as Eq.(2):

\[fr(x,t) = \begin{cases} 1, & \text{if } x = t \\ 0, & \text{otherwise} \end{cases}\] (2)

IDF is defined as the logarithm of the number of all documents in a collection divided by the number of documents in which the observed term appears in Eq. (3).

\[IDF(t) = \log \frac{|D|}{1 + |\{d: t \in d\}|}\] \[\tag{3}\]

\(1 + |\{d: t \in d\}|\) is the number of documents in which the term t appears. When the term-frequency function satisfies \(TF(t, d) \neq 0\), we apply "1 +" to avoid divide by zero. Then, the TF-IDF formula can be defined as (4):

\[TF-IDF(t) = TF(t, d) \times IDF(t)\] (4)

3.1.4 Term Frequency Merging (TF-Merging)

Term Frequency (TF) is one of the weighting techniques that can be used to identify the importance weight of words related to their corpus. However, we cannot directly merge TF values of words that appear in more than one corpus. Each corpus contains a different total number of words with different weights. For example, the word "AAA" in corpus 1 has a TF value of 0.5, while the total number of words in Corpus 1 is 100 words. The word "AAA" in Corpus 2 has a TF value of 0.4, while the total number of words in Corpus 2 is 1,000 words. We cannot find the TF value of the word "AAA" by adding their 2 TF values directly because we are considering two different corpus weights. The way to solve this problem is using word vector normalization (TF normalization) (5), (6) and TF-merging (7).

\[||V||_{2} = \sqrt{V_{1}^{2} + V_{2}^{2} + V_{3}^{2} + \dots + V_{n}^{2}}\] (5)

\(||V||_2\) is the L2-normalization factor, Euclidean norm. \(V_1, V_2, \ldots, V_n\) are the term of occurrence of each word (i) for (n) words in the corpus.

\[TF_{word(i)} = \frac{V_i}{||\overline{V_i}|_2} \tag{6}\]

\(TF_{word(i)}\) is the normalized TF value of word (i), while \(V_i\) is the term of occurrence of word (i) and \(||V||_2\) is the L2-normalization factor of the corpus.

\[TF_{word(i)total} = \sum_{j=0}^{n} TF_{word(i)(j)}\] (7)

\(TF_{word(i)\ total}\) is the final TF value of word (i) when considering n as the number of words in the corpus. While \(TF_{word\ (i)\ (j)}\) is the normalized TF value of word (i) in corpus (j), start from 1 to n.

For example, in Figure 2, if Corpus 1 (Economic category) contains 3 words in total: "investment", "prime minister" and "airport" with their terms of occurrence at 40, 10, and 20 respectively. We can calculate the L2-normalization factor of this corpus as the square root of ((40 power 2) + (10

power 2) + (20 power 2)), which is equal to 45.826. Then we can find that the TF value of the word "investment" in Corpus 1 (Economic category) is 0.8729.

i-WAM	Investment	Prime Minister	Football	Airport		\|\|V\|\|₂
Economic	40	10	0	20	V 10	\[\frac{10^2 + 0^2 + 20^2}{45.826}\]
Politic	20	50	0	10		\[50^2 + 0^2 + 10^2\] = 54.772
Sports	0	0	70	10	,	\(0^2 + 70^2 + 10^2\) = 70.711
i-WAM	Inves	tment	Prime Ministe	r Fo	otball	Airport
Economic		3729 5.826)	0.2182 (10/45.826)	(0/	0 45.826)	0.4364 (20/45.826)
Politic		8651 4.772)	0.9129 (50/54.772)	(0/	0 54.772)	0.1826 (10/54.772)
Sports		0 0.711)	0 (0/70.711)	_	.9899 /70.711)	0.1414 (10/70.711)

Figure 2 Example of TF value (normalized) calculation.

For an example of the TF-Merging operation, see Figure 3. Here, the normalized TF value of the word "investment" in Corpus 1 (Economic category from online news articles) is 0.872. The normalized TF value of the word "investment" in Corpus 2 (Economic category from related tweets) at 0.036. Finally, we see that the total TF value of the word "investment" is 0.872 + 0.036 = 0.908, which can be representative for the Economic category.


m-WAM	Invest ment	Prime Minister	Foot ball	Air port	Bank of America	interest	stock
EC1 (news)	40	10	0	20	0	0	0	45.825
EC2 (tweets)	1	0	0	1	25	10	5	27.422
m-WAN	Л	Invest ment	Prime Minister	Foot ball	Air port	Bank of America	interest	stock
EC1 (news)		0.872	0.218	О	0.436	0	0	0
EC2 (tweet	s)	0.036	0	0	0.036	0.911	0.364	0.182
m-WAN	Л	Invest ment	Prime Minister	Foot ball	Air port	Bank of America	interest	stock
EC1 (updat	ed)	0.908	0.218	0	0.472	0.911	0.364	0.182

Figure 3 Example of TF-Merging operation.

3.1.5 Word Article Matrix (WAM)

WAM is a significant data structure [13] in the Generic Engine for Transpose Association (GETA). It creates a large matrix of weighted relationships between documents and keywords in which the rows are indexed by names of documents (articles) and the columns are indexed by keywords from the documents. The keywords in the documents are counted to fill in the table as shown in Figure 4(a). The initial WAM (i-WAM) is generated by using the normalized TF value of each word. The i-WAM with normalized TF values is shown in Figure 4(b). The documents and words are represented in the form of vectors. The value in each row is the vector of the words representing a document.

Assume there is a query: "You can run the Business Intelligence Wizard to create currency conversion calculations." This query is converted into a model of word vectors, as shown in Figure 4(c).

(a) An ex	ample of W	AM		(b) An exa	mple of the	e i-WAM
Article\ Word	Currency	Intel ligence	Football	Article\ Word	Currency	Intel ligence	Football
Economic	10		2	Economic	0.47		0.10
Politic	2	9	3	Politic	0.10	0.95	0.15
Sports		1	11	Sports		0.05	0.82
(c) An sa	mple query	with word	count	(d) A Cos	ine Similari	ty result
Query	Currency	Intelligend	e Football		rticle	Re	sult
Query	1	1	0	Econom	ic	0.0	592
				Politic		0.3	768
				Sports		0.0	043

Figure 4 Example of WAM.

The set of documents in a corpus is viewed as a set of vectors in a vector space. Each term will have its own axis. Using the cosine similarity technique [20] we can find out the similarity between any two documents (8).

Cosine Similarity(d1, d2) = \[\frac{d1.d2}{||d1||*||d2||}\] (8)

The Cosine Similarity (d1, d2) is the similarity between document d1 and d2, where d1.d2 is the dot product of document vectors d1 and d2. ||d1|| * ||d2|| is the Euclidean length of document vectors d1 and d2.

Lastly, we calculate the cosine similarity values and get the result of the example query as shown in Figure 4(d). As the weight of the word "intelligence" in the Information Technology (IT) category is high, 0.95, the result of the operation shows that the query is more likely to be for a document about IT, which produced the highest cosine similarity score at 0.768.

3.2 This Study's Approach

We propose a semi-supervised learning technique with the utilization of a wellformed text source, as shown in Figure 5. This is the first step. An online news source is used as the main source to collect data from, which gives access to a well-formed document with appropriate grammar that is properly categorized by the publishers. The online news article was retrieved from the Dailynews website, http://www.dailynews.co.th/ [17], published by a popular newspaper in Thailand.

Figure 5 Initial WAM implementation.

A total of 13,085 news articles were collected, as shown in Table 1. The news categories that will become class labels can be extracted automatically. The news articles can be extracted related to their category, after executing the following preprocessing steps: removing HTML tags, removing stop words, word stemming. Then we used Thai word segmentation and the TF-IDF weighting technique to extract a bag of keywords from each news category. Afterwards, we generated the initial WAM (i-WAM) from the set of extracted keywords. The top six of the terms with the highest TF-IDF score were selected as the keywords for each category to search Twitter to enhance the model.

Category	Number of
	Documents
Economic	1630
Entertainment	1450
Foreign	1550
IT	1500
Politics	2755
Regionals	2550
Sports	1650

Table 1 Numbers of retrieved online news documents.

Then, we used the keyword set from i-WAM to collect related tweets through a Twitter search API, as shown in Figure 6. The API allows collecting related social media text, where the search index has a searchback limit of 7 days. After collecting a heap of tweets, around twenty thousand, they were saved in text file format. Subsequently, the same process as described before was used to extract keywords by using Thai word segmentation and the TF-IDF technique. Additional terms were selected according to their TF-IDF value. The result was a new set of keywords indicating specific categories that are potentially used in social media.

Figure 6 Modified WAM (1) implementation.

In the implementation of m-WAM, the Term Frequency merging (TF merging) technique is used, which is generated by updating i-WAM. The TF of existing words in i-WAM is recomputed with additional counting. The newly found words with their TF values are added into the table. As shown in Figure 7, the m-WAM process is repeated, iterating the procedure until a result is achieved in which Precision, Recall, F-measure, and accuracy are in steady state at nearly 100%. Finally, m-WAM is modified to fit social media text. This m-WAM will be an effective model containing terms that can represent a text category and reflect social developments.

Figure 7 Modified WAM (n) implementation.

As can be seen in Figure 8, the evaluation of social media text classification is conducted manually. The training data set is used for building the model while the testing data set is searched from Twitter randomly and used for evaluating the model. The retrieved tweets are evaluated by human judging. The testing

data set evaluates all models, from i-WAM to m-WAM (n). Finally, accuracy rate, Precision, Recall, and F-measure value are determined.

Figure 8 Evaluation process.

4 Experiment Result

After retrieving online news data by using the web crawler module and extracting a set of keywords, we selected the words with the highest TF-IDF score and generated the initial-WAM (i-WAM), as shown in Table 2. We added a row to show the IDF value of each keyword to identify their importance weight. The words "financial budget"/ "งบการเงิน", "Gubgib"/ "กุบกิ๊บ", "refugee"/ "ผูอพยพ", "Windows 10"/ "วินโดวส10", "politician"/ "นักการเมือง", "artificial rain"/ "ฝนหลวง", "karate"/ "คาราเต" are examples of keywords with their TF value in each category (Economic, Entertainment, Foreign, IT, Politics, Regional and Sports) respectively.

The keywords that were extracted from the online news source showed a significant result, especially the keywords from the Entertainment category

("Gubgib"/ "กุบกิ๊บ": the name of a popular actress in Thailand), IT category ("Windows 10"/ "วินโดวส10"), and sports category ("karate"/ "คาราเต"). However, their TF values can identify their text categories when we consider the word vector cosine similarity. Then, these keyword terms were used to search Twitter through the Twitter search API.

Article\Word Word1 Word2 Word3 Word4 Word5 Word6 Word7 financial budget 'งบการเงิน ' gubgib 'กุบกิ๊บ' refugee 'ผูอพย พ' Windows 10 'วินโดวส1 0' politician 'นักการเมื อง' artificial rain 'ฝนหลวง' karate 'คาราเต' IDF (t) 2.23044 1.83250 1.68638 1.88649 1.38535 2.23044 2.05435 Economic 0.00013 0.00001 Entertainment 0.00023 Foreign 0.00001 0.00044 0.00004 IT 0.00009 Politics 0.00035 0.00001 Regionals 0.00012 Sports 0.00001 0.00022

Table 2 Part of the i-WAM.

Around twenty thousand tweets were collected as our data source from to extract a new set of the Twitter keywords. The m-WAM1 was generated from these new specific keywords and merged with the existing keywords from i-WAM (TF merging operation), well-formed text source keywords. The newly found words in the m-WAM1 showed a significant result. For example, in Table 3, "refugee"/"ผอพยพ ู้ " (sample keyword from i-WAM) led to finding a new keyword, "Tier3"/ "เทยร ี 3", which scoped down the word vectors for the ์ Foreign category. Other category keywords also generated promising results.

Subsequently, the same m-WAM process was repeated. Keywords with high potential were selected from the m-WAM(n-1), i.e. the words with the highest TF-IDF score (top 5) in their own category, to gather all related tweets. From this technique, more specific keywords were found – less common words – which can effectively represent their category. Hence, we could generate a new m-WAM(n), which can be a productive model for social media text classification. This procedure was repeated until the Precision, Recall, Fmeasure, and accuracy results reached steady state. Finally, it was found that the iteration number of m-WAM that satisfied the best performance of social media text classification is 3 (n = 3), i.e. i-WAM, m-WAM1, m-WAM2 and m-WAM3. The rest of the results are shown in Tables 4 and 5.

Table 3 Part of m-WAM1

Article\Word	Word1	Word2	Word3	Word4	Word5	Word6	Word7
	financial budget 'งบการเงิน'	Gubgib 'กุบกิ๊บ'	refugee 'ผูอพยพ '	Window s 10 'วินโดวส	politician 'นักการเมื อง'	artificial rain 'ฝนหลวง'	karate 'คาราเต'
				10'
IDF (t)	1.59151	1.28831	1.57032	1.87550	1.27553	1.61838	1.47756
Economic	0.00077				0.00001
Entertainment		0.00156
Foreign	0.00001		0.00116		0.00004
IT				0.00076
Politics					0.00180	0.00001
Regionals					0.00008	0.00098
Sports			0.00001				0.00119

Article\ Word	Word8	Word9	Word10	Word11	Word12	Word13	Word14
	layoff 'ลดพนักง าน'	BeeKPN 'บKPN' ี้	Tier3 'เทียร3'	Microsoft 'ไมโคร ซอฟท'	drought 'ภัยแลง'	Venerable Monk 'หลวงพอนุ ช'	#Team Thailand '#ทีม ชาติไทย'
IDF (t)	2.22978	2.63726	2.31275	2.81335	2.31275	2.03520	3.11438
Economic	0.0010
Entertainment		0.00028
Foreign			0.00129
IT				0.00094
Politics			0.00007		0.00023
Regionals					0.00022	0.00140
Sports							0.00007

Table 4 Part of m-WAM2.

Article\Word	Word1	Word2	Word3	Word4	Word5	Word6	Word7
	financial budget 'งบการเงิน'	Gubgib 'กุบกิ๊บ'	refugee 'ผูอพยพ '	Windows 10 'วินโดวส 10'	politician 'นักการเมื อง'	artificial rain 'ฝนหลว ง'	karate 'คาราเต'
IDF (t)	1.40959	1.08944	1.40527	2.93247	1.10856	1.45535	1.36427
Economic	0.00130				0.00001
Entertainment		0.00261
Foreign	0.00004		0.00171		0.00004
IT				0.00072
Politics					0.00280	0.00001
Regionals					0.00008	0.00154
Sports			0.00001				0.00191

Article\Word	Word8	Word9	Word10	Word11	Word12	Word13	Word14
	layoff 'ลดพนัก งาน'	BeeKP N 'บKPN ี้ '	Tier3 'เทียร3'	Microsoft 'ไมโคร ซอฟท'	drought 'ภัยแลง'	Venerable Monk 'หลวงพอนุ ช'	#Team Thailan d '#ทีม
							ชาติไทย
							'
IDF (t)	1.37617	1.10209	1.39675	1.22775	1.08531	1.84139	3.10856
Economic	0.00326
Entertainment		0.00016
Foreign			0.00334
IT				0.01302
Politics			0.00007		0.00448
Regionals					0.00022	0.00226
Sports							0.00054
Article\Word	Word15	Word16	Word17	Word18	Word19	Word20	Word21
	Flight	Ploypan	Xinjiang	Intel	Prime	Abbot	Karate
	'เสนทาง	'พลอยพรร	'ซินเจียง'	'อินเทล'	minister	'เจาอาวา	do
	บิน'	ณ'			Tuu	ส'	'คาราเต-
					'นายกฯ ตู'		โด'
IDF (t)	2.56449	3.40959	2.56449	1.82981	2.56449	1.84139	2.93247
Economic	0.00066
Entertainment		0.00146
Foreign			0.00069
IT				0.00495
Politics					0.00242
Regionals						0.00540
Sports							0.00173

As an interesting point, some new terms were also added because they occurred very frequently on Twitter rather than in the online news document, for example more variations of abbreviations and trendy terms such as "#TeamThailand" / "#ทีมชาติไทย", a specific hash tag created by some social media users for Thailand sports fans, and "Prime Minister Tuu" / "นายกฯตู", "Tuu"/ "ตู" being the nick name of the present Prime Minister of Thailand and "นายกฯ" is an abbreviation of "Prime Minister" in the Thai language. This is a common phenomenon on Twitter, referring to the nature of this useful communication tool, free and open for opinion sharing with a 140-character limitation. In addition, when the series of all category keywords is considered, we can see a real-time reflection of interesting social issues. For example, from the keywords in the Foreign category in Table 5 – word 3 ("Refugee"/ "ผูอพยพ"), word 10 ("Tier3"/ "เทียร3"), word 17 ("Xinjiang"/ "ซินเจียง"), and word 24 ("Boycott"/ "คว่ําบาตร") – it can be seen that there was a topic related to the Uyghur

refugees with some political issues shared on social media during the period covered by the experiment.

Table 5 Part of m-WAM3.

Article\Word	Word1	Word2	Word3	Word4	Word5	Word6	Word7
	financial budget 'งบการเงิน	Gubgib 'กุบกิ๊บ'	refugee 'ผูอพย พ'	Windows 10 'วินโดวส	politician 'นักการเมื อง'	artificial rain 'ฝนหลวง'	karate 'คาราเ ต'
	'			10'
IDF (t)	1.41044	1.10941	3.41044	2.93331	3.41044	1.43731	1.2964 9
Economic	0.00174				0.00001
Entertainment		0.00349
Foreign	0.00005		0.00151		0.00004
IT				0.00067
Politics					0.00376	0.00001
Regionals					0.00008	0.00144
Sports			0.00001				0.0026 9

Article\Word	Word8	Word9	Word10	Word11	Word12	Word13	Word14
	layoff 'ลดพ	BeeKPN 'บี้KPN'	Tier3 'เทียร3'	Microsoft 'ไมโคร	drought 'ภัยแลง	Venerable Monk 'หลวงพอนุ	#Team Thailand '#ทีม
	นักงา			ซอฟท'	'	ช'	ชาติไทย
	น'						'
IDF (t)	1.30663	3.41044	1.39340	1.13168	1.10941	1.85413	2.26431
Economic	0.00465
Entertainment		0.00011
Foreign			0.00404
IT				0.01529
Politics			0.00007		0.00648
Regionals					0.00022	0.00166
Sports							0.00072

Article\Word	Word15	Word16	Word17	Word18	Word19	Word20	Word21
	flight 'เสนทาง	Ploypan 'พลอยพร	Xinjiang 'ซินเจียง	Intel 'อินเทล'	prime minister Tuu	Abbot 'เจาอาวาส	karate do
		รณ'	'		'นายกฯ ตู'	'	'คาราเต
	บิน'						-โด'
IDF (t)	2.56534	1.40611	1.44195	1.70287	1.63228	1.16740	2.29649
Economic	0.00065
Entertainment		0.00281
Foreign			0.00310
IT				0.00469
Politics					0.00140
Regionals						0.01163
Sports							0.00184

Article\Word	Word22	Word23	Word24	Word25	Word26	Word27	Word28
	flight cancelati on	acknowledge ment of children	boycott 'คว่ําบาตร'	Blognone 'Blognone'	Taksin 'ทักษิณ'	Putta Issara 'หลวงปูพุทธ	Thailand Open 'ไทยแลน
	'ยกเลิกเที่ ยวบิน'	'รับรองบุตร'				อิสระ'	ด โอเพน'
IDF (t)	2.03022	1.97907	3.41044	3.10941	3.41044	2.71147	1.73834
Economic	0.00147
Entertainment		0.00112
Foreign			0.00005
IT				0.00017
Politics					0.00010
Regionals						0.00052
Sports							0.00533

As for the evaluation process, Figure 9 (a)-(d) shows the Precision, Recall and F-measure values. All models were evaluated with a testing data set, which was randomly extracted from Twitter.

(a	) The	i-WAM	evaluation	score


Accuracy 80.22%	Precision	Recall	F-score	Accuracy 98.02%	Precision	Recall	F-score
Economic	40.50%	92.22%	56.28%	Economic	95.22%	100%	97.55%
Entertainment	98.30%	99.11%	98.70%	Entertainment	100%	98.15%	99.07%
Foreign	97.50%	99.62%	98.55%	Foreign	100%	99.08%	99.54%
IT	80.25%	100%	89.04%	IT	100%	100%	100%
Politic	74.15%	97.83%	84.36%	Politic	96.63%	100%	98.29%
Regional	95.20%	47.75%	63.60%	Regional	100%	82.80%	90.59%
Sports	94%	99.24%	96.55%	Sports	94.20%	100%	97.01%

Accuracy 99.25%	Precision	Recall	F-score	Accuracy 99.55%	Precision	Recall	F-score
Economic	98.47%	100%	99.23%	Economic	99.55%	100%	99.77%
Entertainment	100%	98.20%	99.09%	Entertainment	100%	99.42%	99.71%
Foreign	100%	100%	100%	Foreign	100%	99.41%	99.70%
IT	100%	100%	100%	IT	100%	100%	100%
Politic	98.74%	100%	99.37%	Politic	98.92%	100%	99.46%
Regional	100%	97.55%	98.76%	Regional	100%	98.12%	99.05%
Sports	98.27%	100%	99.13%	Sports	99.50%	100%	99.75%

Figure 9 Accuracy rate, Precision, Recall and F-measure values.

Because there were more common keywords in i-WAM, its Precision, Recall, and F-measure scores were low, especially in the Economic and Politics categories. However, when m-WAM was updated with more specific keywords from related social-media text, all of the evaluation factors increased dramatically. Finally, the value of Precision, Recall, F-measure, and accuracy nearly converged to 100% after reaching m-WAM3, as shown in Figure 10.

Figure 10 Graphs of accuracy rate, Precision, Recall, and F-measure values.

5 Conclusions and Future Work

The growth and information power of social media text are remarkable. Keywords collected from social media can be a prediction tool of social developments. A holistic decision support system can be developed according to interesting topics collected from the dynamic social media environment, which is a factor of concern today and will be in the future. Social media text classification using Term Frequency-Inverse Document Frequency (TF-IDF) weighting and Word Article Matrix (WAM) is very effective. Text from social media can be categorized with a sense of human familiarity by utilizing online news categories that have already been indicated by the publishers. Good results can be expected from the proper modified WAM (m-WAM) for social media text classification after updating it for 3 times, the suitable iteration number of m-WAM modifications. This modified WAM can be a suitable model for social

media text classification and the set of keyword terms can be representative of interesting social topics during the time of monitoring. However, a good result also depends on the performance of the Thai word segmentation module. Alternative Thai word segmentation programs, such as Name Entity Recognition (NER), can generate proper word boundaries for conducting other processes, so keywords can be generated more accurately and the model's accuracy will be improved significantly. Deep learning could also be a good choice for conducting experiments related to natural language processing and text mining.

Acknowledgements

This research was financially supported by Thailand Advanced Institute of Science and Technology (TAIST), National Science and Technology Development Agency (NSTDA), Tokyo Institute of Technology, Sirindhorn International Institute of Technology (SIIT), Thammasat University (TU) under the TAIST Tokyo Tech Program, and partly supported by the National Research University Project of the Office of the Higher Education Commission and Thammasat University.

We would also like to express our gratitude to Dr. Eakasit Pacharawongsakda for his assistance with data-mining techniques and his comments that greatly improved the manuscript during the course of this research.

Research Intelligence

Data from OpenAlex ↗

Metrics

8

Citations

3.45

FWCIfield-weighted

94th

Percentilevs same year + field

Article

Work type

GOLD

Open Access

Topics

Data Mining Algorithms and Applications Primary

1.00

Information Systems › Computer Science › Physical Sciences

Rough Sets and Fuzzy Logic

0.99

Data Management and Algorithms

0.96

Related Research

Citation Trend

2

2017

2

2018

1

2019

2

2022

1

2023

Citation Timeline

Year	Citations
2023	1
2022	2
2019	1
2018	2
2017	2

Semantic Profile AI-classified research signals

Core Domains

Computer science 0.67

level 0

Data mining 0.64

level 1

Profit (economics) 0.60

level 2

Secondary Topics

Margin (machine learning) 0.47 Product (mathematics) 0.43 Machine learning 0.22 Mathematics 0.21

Institution Network

Burapha University TH
Komate Amphawan
Université de Bretagne Occidentale FR
Philippe Lenca
Institut Mines-Télécom FR
Philippe Lenca
Laboratoire des Sciences et Techniques de l’Information de la Communication et de la Connaissance FR
Philippe Lenca
King Mongkut's University of Technology North Bangkok TH
Institut Telecom, Telecom Bretagne, UMR CNRS 3192 Lab-STICC, France
Chulalongkorn University TH
Anuchit Jitpattanakul · Athasit Surarerks

References

Simon, K., Digital, Social & Mobile Worldwide in 2015, We Are Social Ltd., http://wearesocial.net/tag/statistics/ (21 January 2015).
Twitter, Twitter Usage/Company Facts, Twitter, Inc., https://about. twitter.com/company (30 June 2015).
Dave, C. Global Social Media Research Summary 2015, Smart Insights (Marketing Intelligence), Ltd., http://www.smartinsights.com/social-media-marketing/social-media-strategy/new-global-social-media-research (13 October 2015).
Twitter, Entities in Objects, Twitter, Inc., 2015. https://dev.twitter.com/
overview/api/entities-in-twitter-objects (13 October 2015).
Twitter, API Rate Limits, Twitter, Inc., 2015. https://dev.twitter.com/rest/
public/rate-limiting (13 October 2015).
Episod, Search API is limited to the last 7 days?, Twitter, Inc., https://twittercommunity.com/t/search-api-is-limited-to-the-last-7-days/ 11603 (17 July 2013).
Irfan, R., King, C.K., Grages, D., Ewen, S., Khan, S.U., Madani, S.A., Kolodziej, J., Wang, L., Chen, D., Rayes, A., Tziritas, N., Xu, C.Z., Zomaya, A.Y., Alzahrani, A.S. & Li, H., A Survey on Text Mining in Social Networks, Cambridge Journal, The Knowledge Engineering Review, 30(2), pp. 157-170, 2015.
Patel, P. & Mistry, K., A Review: Text Classification on Social Media Data, IOSR Journal of Computer Engineering, 17(1), pp. 80-84, 2015.
Lee, K., Palsetia, D., Narayanan, R., Patwary, Md.M.A., Agrawal, A. & Choudhary, A.S, Twitter Trending Topic Classification, in Proceeding of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDW
Kateb, F. & Kalita, J., Classifying Short Text in Social Media: Twitter as Case Study, International Journal of Computer Applications, 111(9), pp. 1-12, 2015.
Chirawichitichai, N., Sanguansat, P. & Meesad, P., A Comparative Study on Feature Weight in Thai Document Categorization Framework, 10th International Conference on Innovative Internet Community Services (I2CS), IICS, pp. 257-266, 2010.
Theeramunkong, T. & Lertnattee, V., Multi-Dimension Text Classification, SIIT, Thammasat University, 2005.http://www.aclweb.org
/anthology/C02-1155 (25 October 2015).
Viriyayudhakorn, K., Kunifuji, S. & Ogawa, M., A Comparison of Four Association Engines in Divergent Thinking Support Systems on Wikipedia, Knowledge, Information, and Creativity Support Systems, KICSS2010, Springer, pp. 226-237, 2011.
Sornlertlamvanich, V., Pacharawongsakda, E. & Charoenporn, T., Understanding Social Movement by Tracking the Keyword in Social Media, in MAPLEX2015, Yamagata, Japan, February 2015.
Olston, C. & Najork, M., Web Crawling, Foundation and Trends in Information Retrieval, 4(3), pp. 175-246, 2010.
RapidMiner, The Open Source Platform of Choice, Rapid Miner, 2015. https://rapidminer.com/ (15 October 2015).
Dailynews, Online News, Dailynews web, Ltd., 2015, http://www.daily news.co.th/ (15 October 2015).
Meknavin, S., Charoenpornsawat, P. & Kijsirikul, B., Feature-based Thai Word Segmentation, National Electronics and Computer Technology Center, 1997, http://www.cs.cmu.edu/~paisarn/papers/nlprs97.pdf (15 October 2015).
Wu, H.C., Luk, R.W.P., Wong, K.F. & Kwok, K.L., Interpreting TF-IDF Term Weights as Making Relevance Decisions, ACM Transactions on Information Systems, 26(3), Article 13, pp. 1-37, 2008. DOI: 10.1145/1361684.1361686
Vembunarayanan, J., Tf-Idf and Cosine Similarity, https://janav. wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/ (27 October 2013).

Download PDF

[ref-1] Simon, K., Digital, Social & Mobile Worldwide in 2015, We Are Social Ltd., http://wearesocial.net/tag/statistics/ (21 January 2015).

[ref-2] Twitter, Twitter Usage/Company Facts, Twitter, Inc., https://about. twitter.com/company (30 June 2015).

[ref-3] Dave, C. Global Social Media Research Summary 2015, Smart Insights (Marketing Intelligence), Ltd., http://www.smartinsights.com/social-media-marketing/social-media-strategy/new-global-social-media-research (13 October 2015).

[ref-4] Twitter, Entities in Objects, Twitter, Inc., 2015. https://dev.twitter.com/

[ref-5] overview/api/entities-in-twitter-objects (13 October 2015).

[ref-6] Twitter, API Rate Limits, Twitter, Inc., 2015. https://dev.twitter.com/rest/

[ref-7] public/rate-limiting (13 October 2015).

[ref-8] Episod, Search API is limited to the last 7 days?, Twitter, Inc., https://twittercommunity.com/t/search-api-is-limited-to-the-last-7-days/ 11603 (17 July 2013).

[ref-9] Irfan, R., King, C.K., Grages, D., Ewen, S., Khan, S.U., Madani, S.A., Kolodziej, J., Wang, L., Chen, D., Rayes, A., Tziritas, N., Xu, C.Z., Zomaya, A.Y., Alzahrani, A.S. & Li, H., A Survey on Text Mining in Social Networks, Cambridge Journal, The Knowledge Engineering Review, 30(2), pp. 157-170, 2015.

[ref-10] Patel, P. & Mistry, K., A Review: Text Classification on Social Media Data, IOSR Journal of Computer Engineering, 17(1), pp. 80-84, 2015.

[ref-11] Lee, K., Palsetia, D., Narayanan, R., Patwary, Md.M.A., Agrawal, A. & Choudhary, A.S, Twitter Trending Topic Classification, in Proceeding of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDW

[ref-12] Kateb, F. & Kalita, J., Classifying Short Text in Social Media: Twitter as Case Study, International Journal of Computer Applications, 111(9), pp. 1-12, 2015.

[ref-13] Chirawichitichai, N., Sanguansat, P. & Meesad, P., A Comparative Study on Feature Weight in Thai Document Categorization Framework, 10th International Conference on Innovative Internet Community Services (I2CS), IICS, pp. 257-266, 2010.

[ref-14] Theeramunkong, T. & Lertnattee, V., Multi-Dimension Text Classification, SIIT, Thammasat University, 2005.http://www.aclweb.org

[ref-15] /anthology/C02-1155 (25 October 2015).

[ref-16] Viriyayudhakorn, K., Kunifuji, S. & Ogawa, M., A Comparison of Four Association Engines in Divergent Thinking Support Systems on Wikipedia, Knowledge, Information, and Creativity Support Systems, KICSS2010, Springer, pp. 226-237, 2011.

[ref-17] Sornlertlamvanich, V., Pacharawongsakda, E. & Charoenporn, T., Understanding Social Movement by Tracking the Keyword in Social Media, in MAPLEX2015, Yamagata, Japan, February 2015.

[ref-18] Olston, C. & Najork, M., Web Crawling, Foundation and Trends in Information Retrieval, 4(3), pp. 175-246, 2010.

[ref-19] RapidMiner, The Open Source Platform of Choice, Rapid Miner, 2015. https://rapidminer.com/ (15 October 2015).

[ref-20] Dailynews, Online News, Dailynews web, Ltd., 2015, http://www.daily news.co.th/ (15 October 2015).

[ref-21] Meknavin, S., Charoenpornsawat, P. & Kijsirikul, B., Feature-based Thai Word Segmentation, National Electronics and Computer Technology Center, 1997, http://www.cs.cmu.edu/~paisarn/papers/nlprs97.pdf (15 October 2015).

[ref-22] Wu, H.C., Luk, R.W.P., Wong, K.F. & Kwok, K.L., Interpreting TF-IDF Term Weights as Making Relevance Decisions, ACM Transactions on Information Systems, 26(3), Article 13, pp. 1-37, 2008. DOI: 10.1145/1361684.1361686

[ref-23] Vembunarayanan, J., Tf-Idf and Cosine Similarity, https://janav. wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/ (27 October 2013).

Social Media Text Classification by Enhancing Well-Formed Text Trained Model

Authors

Abstract

Keywords

1 Introduction

2 Related Works

3 Experiment

3.1 Main Techniques

3.1.1 Web Crawler

3.1.2 Word Segmentation

3.1.3 Term Frequency-Inverse Document Frequency (TF-IDF)

3.1.4 Term Frequency Merging (TF-Merging)

3.1.5 Word Article Matrix (WAM)

3.2 This Study's Approach

4 Experiment Result

5 Conclusions and Future Work

Acknowledgements

Research Intelligence

Metrics

Topics

Related Research

Citation Trend

Citation Timeline

Semantic Profile AI-classified research signals

Institution Network

References