1. Home
  2. Archives
  3. Vol 16 (2022) Issue 1
  4. Articles

A Classifier to Detect Profit and Non Profit Websites Upon Textual Metrics for Security Purposes

Abstract

Currently, most organizations have a defense system to protect their digital communication network against cyberattacks. However, these defense systems deal with all network traffic regardless if it is from profit or non-profit websites. This leads to enforcing more security policies, which negatively affects network speed. Since most dangerous cyberattacks are aimed at commercial websites, because they contain more critical data such as credit card numbers, it is better to set up the defense system priorities towards actual attacks that come from profit websites. This study evaluated the effect of textual website metrics in determining the type of website as profit or nonprofit for security purposes. Classifiers were built to predict the type of website as profit or non-profit by applying machine learning techniques on a dataset. The corpus used for this research included profit and non-profit websites. Both traditional and deep machine learning techniques were applied. The results showed that J48 performed best in terms of accuracy according to its outcomes in all cases. The newly built models can be a significant tool for defense systems of organizations, as they will help them to implement the necessary security policies associated with attacks that come from both profit and non-profit websites. This will have a positive impact on the security and efficiency of the network.

Keywords

1 Introduction

Nowadays, people commonly use websites for communicating with different types of institutions [1]. Websites are significant for both users and owners to achieve their goals. Websites can differ from each other according to the type of services they provide. Sites that are developed for profit institutions who are interested in providing financial services are called profit websites. The American shopping website Ebay.com is an example of a profit website [2]. The other type is called nonprofit websites, which are developed for nonprofit institutions and provide public informational services for users. Yarmouk University's website (https://www.yu.edu.jo) is an example of a nonprofit website.

Every website on the Internet is vulnerable to security attacks. The threats range from human mistakes to advanced attacks by organized cyber criminals. According to the Investigations Report of Data breaking by Verizon, the main drive for cyber attackers is financial. So, whether you run an eCommerce project or a simple business website, the possibility of an attack is always there [3]. This is because businesses usually save the data of customers' bank accounts and credit cards, mailing addresses, email addresses, usernames and passwords. Cyber security attackers utilize these data for gaining money by credit card fraud or the use of consumers' private information for personality theft or fraud [4]. Therefore, website security is a major issue for business and profit companies [5].

An organization with a defense system to protect against cyberattacks has to be aware of cyber security threats that come from profit websites more than from nonprofit websites. Business websites are more vulnerable to security attacks because they keep more sensitive data such as credit card numbers.

Each cyberattack on an organization's network has its own characteristics and with the wide range of different kinds of attacks going around, it may seem impossible to protect your network against all of them [3]. Instead, you can direct your defense system towards actual threats and dangerous attacks, which come from high-traffic websites, such as profit websites. This would enhance the level of security and have a positive effect on the performance of the network. When the defense system only has to deal with specific attacks compared to when having to deal with all of them, minimizes network delay, because it enforces only the necessary security policies on the network, which is reflected in the speed of the network.

For this reason, our research built a classifier that can detect profit from nonprofit websites. This can be very useful for companies to set the proper security level for their defense systems without network overhead, which can help these institutions to increase the security and efficiency of their network.

Many research papers developed web page classification using mining techniques, but none of them studied classification of web pages into profit and nonprofit based on textual analysis for security purposes, like we have done in our research project. Some of these previous studies are discussed below.

Babapour and Roostaee [6] proposed an approach to address the challenge of classifying significant web content (short-term web content and long-term web content), where such a method could enhance search engine performance. They classified web pages into these two types by using machine learning techniques. They used natural language processing in addition to text mining methods for pre-processing the data, and then applied the machine learning techniques for classifying web pages. Qazi and Goudar [7] proposed a feasible solution to the problems associated with web page classification using a method called Ontology-based Term Weighting. Their approach depends on constructing a domain ontology and then choosing elements that can enhance the classification process. They conducted an experiment to evaluate their approach and came up with promising results.

Sun, et al. [8] implemented support vector machine (SVM) by grabbing web pages from different sites into different classes and taking both the content and context elements into consideration. Their classification approach was verified on a dataset called WebKB. This classification method produced better results than other classification approaches such as Foil-Pilfs on the same data set. The authors also demonstrated that including context elements (chiefly hyperlinks) enhances the efficiency of classification. The work by Hongjian and Yifei in [9] used an SVM classifier on a sample dataset and then applied article swarm optimization (PSO) to optimize the parameters. During the testing process of their new method about 100,000 web pages were gathered. F-measure was used to evaluate the empirical results, which indicated that their approach outperformed the SVM technique.

Chun, et al. [10] proposed a technique for extracting the information content from news web pages based on density characteristics. In their approach, HTML documents are divided into several blocks. The next step is to compute a value for each document according to particular density characteristics. Finally, the C4.5 data mining technique is applied to generate a classification of the documents blocks. This approach makes extracting information from news web pages simpler. Empirical tests indicated that this technique is easy and efficient for extracting news information. The work by Yazdani, et al. in [11] implemented naïve-Bayes models for each category of web pages and expanded the Hidden Markov model that was applied. A group of websites was used to compute the parameters of the models. Websites were formed in a tree pattern to classify them, and an algorithm called Viterbi was modified to estimate the possibility of producing these structures by each model.

Fiol-Roig, et al. [12] used a classifier (decision tree) that can classify web pages automatically, supporting search engines in retrieving the desired information. In their paper, they also tested the probability of using an automatic classifier.

Various classifiers using different mining methods that were exploited to build these classifiers were proposed. Ali, et al. in [13] presented the utilization of linear discriminant analysis (LDA), which is a general multivariate statistical data analysis method. It can enhance the classification processes of several classification models. LDA depends on the idea of gaining isolation among groups. It is usually utilized for reducing the dimensionality of datasets. Ali and Abdullah [14] introduced fast HP-PL as a novel parallel method for simplifying dimensionality reduction. It also enhances the accuracy of big data classification by utilization of the computational abilities of distributed-memory clusters. The method was implemented on Apache Spark. The authors also explain the importance of dimensionality reduction. This is a data mining technique that has become very popular and is an important step in many ML methods. Ali and Abdullah in [15] introduced a new parallel application of grid optimization using Spark Radoop to decrease large computation loads and facilitate the processing of big data.

3 Methodology

Our proposed research project consists of four phases, which are shown in Figure 1.

5

Figure 1 The research project work flow.

These research project phases are explained in detail in the following sections.

3.1 Building a Profit and Nonprofit Websites Dataset

This was the first step in our research project, where various websites were collected manually. These websites included profit and non-profit websites. Profit websites are marketing websites while non-profit websites are public informational websites from institutions such as universities, hospitals and ministries.

We used Readability Test Tool to extract the textual metrics of these collected websites. The readability test tool is an easy and rapid tool that can assess the readability of published texts [16]. Readability Test Tool computes the textual metrics of a web page, i.e., number of sentences, number of words, number of complex words, percentage of complex words, average number of words per sentence, and average number of syllables per word, where compound words are words with three or more syllables [16]. Figure 2 shows the textual metrics for the King Saud University website, as an example of using Readability Test Tool.

5

Figure 2 Readability Test Tool results for the King Saud University home page.

We built our dataset using the MS-Excel 2010 database management system. It consisted of 237 rows. There were no missing values in our dataset. Its characteristics are clearly shown in Table 1, while Table 2 refers to a sample from the dataset.

AttributeValue
NoOfSentReal
NoOfWordReal
NoOfComWordReal
ComWordReal
AvgWordSentReal
AvgSyllWordReal
ProfitOrNonProfitProfit, NonProfit
WebsiteTypeUniversity, Hospital, Ministry,Business

Table 1 Dataset characteristics.

Table 2 Sample of dataset.

NoOfSentNoOfWordNoOfComWComWorAvgWordAvgSyllWProfit orWebsite
orddSentordNonProfitType
11375115520.646.751.68ProfitBusiness
117110120618.719.481.81ProfitBusiness
1404237517.734.611.78NonProfitUniversity
385220244720.39.261.8NonProfitUniversity
22328.711.51.48NonProfitHospital
9884219122.689.31.79NonProfitMinistry

3.2 Statistical Analysis Using Weka Machine Learning Techniques (J48, NB and SVM)

We applied the machine learning techniques in Weka tool to our dataset to generate several patterns and rules. These data mining techniques were J48 decision tree, Naïve Bayes (NB) and Support Vector Machine (SVM) techniques.

3.3 Statistical Analysis Using Weka Deep Learning Technique (DL4jMLpClassifier)

In this research stage, we applied a Weka deep learning technique called DL4jMLpClassifier to our dataset to obtain several patterns and rules.

4 Results and Evaluation

We applied different machine learning techniques to our data set for generating patterns and rules. These data mining techniques were J48 decision tree, Naïve Bayes and Support Vector Machine. Each of these machine learning techniques was applied to different training datasets. They involved 2, 5 and 10 folds in addition to the 66% training datasets. On the other hand, we also applied a Weka deep learning technique called DL4jMLpClassifier to our dataset in order to measure its accuracy in classification.

4.1 Results When the Class Label is 'ProfitOrNonProfit'

Figure 3 shows the J48 decision tree when the class label values were 'Profit' and 'NonProfit'. In Figure 3, it can be seen that the websites were classified either as Profit or NonProfit according to two metrics, i.e., the percentage of complex words and the average syllables per word.

Tables 3, 4 and 5 shows the results for J48 decision tree, Naïve Bayes, and Support Vector Machine, respectively, when the class label values were Profit and NonProfit.

2

Figure 3 J48 decision tree (Profit and NonProfit).

Table 3 J48 decision tree classifier (profit and nonprofit).

Training
Dataset
CorrectPercentageProfit
Precision
NonProfit
Precision
Mean
error
Profit
F-Measure
NonProfit
F-Measure
66% training6074.07410.5260.8060.3950.4880.826
2 folds16870.88610.7250.7020.37780.5920.774
5 folds16268.35440.6540.6990.39990.5860.744
10 folds15967.08860.6410.6860.41180.5620.736

As can be seen in Table 3, the best result for J48 was achieved in the 66% training case, which had the highest number of instances that were classified correctly.

Table 4 Naive-base classifier (profit and nonprofit).

Training
Dataset
CorrectPercentageProfit
Precision
NonProfit
Precision
Mean
error
Profit
F-Measure
NonProfit
F-Measure
66% training5871.60490.4760.80.33560.4650.807
2 folds14661.60340.5740.6310.39970.4350.709
5 folds14259.91560.5440.6170.39790.3950.7
10 folds14360.33760.5520.620.3970.4050.703

In the case of NB, the best result was achieved in the 66% training case, which had the highest number of instances that were classified correctly, so this is the best choice, as can be seen in Table 4.

Table 5 Support vector machine classifier (profit and nonprofit).

TrainingCorrectDomaontogoProfitNonProfitMeanProfitNonProfit
DatasetCorrectPercentagePrecisionPrecisionerrorF-MeasureF-Measure
66% training5365.43210.4120.830.34570.50.736
2 folds14561.18140.8330.60.38820.1790.746
5 folds14259.91560.6470.5950.40080.1880.734
10 folds14561.18140.7860.6010.38820.1930.744

Table 5 shows that the best choice for SVM was the 66% training case, which had the highest number of instances that were classified correctly. Overall, J48 was shown to be the best classifier according to its outcomes in all cases.

4.2 Results When the Class Label is 'WebsiteType'

Figure 4 in the Appendix shows the J48 decision tree when the values of the class label were University, Hospital, Ministry, and Business. The outcomes in Figure 4 were probably influenced by the number of each website type. The outcomes indicate that the most significant metric that distinguishes business websites from other types of websites is the Profit or Nonprofit metric. Figure 4 shows that a website can be classified as Business website only when the value of the profitOrnonprofit metric is equal to 'profit'.

Tables 6, 7 and 8 shows the outcomes of J48 decision tree, NB, and SVM, respectively, when the values of the class label were University, Hospital, Ministry, and Business.

6

Figure 4 J48 decision tree (University, Hospital, Ministry, and Business).

Table 6 J48 decision tree classifier (university, hospital, ministry, and business).

Training DatasetCorrectCorrect
Percentage
IncorrectIncorrect
Percentage
Mean errorF-Measure
66% training5162.9633037.0370.21610.58
2 folds17473.41776326.58230.16630.68
5 folds17372.99586427.00420.16860.678
10 folds17272.57386527.42620.1660.675

As can be seen in Table 6, the best result for J48 was in the 2 folds case, which had the highest number of instances that were classified correctly.

Table 7 Naive-base classifier (university, hospital, ministry, and business).

Training DatasetCorrectCorrect
Percentage
IncorrectIncorrect
Percentage
Mean
error
F-Measure
66% training5162.9633037.0370.2330.59
2 folds15364.5578435.4430.21810.646
5 folds15565.40088234.59920.20730.63
10 folds15464.97898335.02110.20450.637

The best result for NB was in the 5 folds case, which had the highest number of instances that were classified correctly, so this is the best choice as can be seen in Table 7.

Table 8 Support vector machine classifier (university, hospital, ministry, and business).

Training DatasetCorrectCorrect
Percentage
IncorrectIncorrect
Percentage
Mean
error
F-Measure
66% training5770.37042429.62960.28810.593
2 folds17774.68356025.31650.28020.656
5 folds17774.68356025.31650.27990.656
10 folds17774.68356025.31650.27950.656

Table 8 shows that the best results for SVM were obtained for 2, 5, 10 folds. Overall, SVM was shown to be the best classifier according to its outcomes in all cases.

4.3 Results when applying Weka Deep Learning Technique (DL4jMLpClassifier)

Table 9 shows the results of the DL4jMLpClassifier when the values of the class label were Profit and NonProfit. We applied this deep learning technique to three training datasets, i.e., 2, 5 and 10 folds, with different numbers of layers. The numbers of layers used were 2, 5, 7, and 9 layers. As shown in Table 9, DL4jMLpClassifier did not achieve high prediction accuracy compared with the traditional mining techniques. This was because the dataset used was not large, and deep learning requires large datasets to come up with good prediction results.

Table 9 DL4jMLpClassifier (Profit and NonProfit).

Training Dataset2 Layers5 Layers7 Layers9 Layers
2 folds41.350257.805957.805957.8059
5 folds41.350257.805957.805957.8059
10 folds41.350257.805957.805957.8059

5 Conclusions and Future Work

As most cyberattacks target business websites, due to keeping more important data such as credit cards, it is highly necessary to set up priorities of the defense systems towards attacks from such websites. In this study, we built a classifier that can classify profit and non-profit websites according to some textual website metrics for security purposes. Deep neural networks, J48 decision tree, Naïve Bayes, and Support Vector Machine techniques were applied to a website dataset to create classifiers. The results indicated that websites were classified as profit or non-profit according to two primary metrics, i.e., the percentage of complex words and the average number of syllables per word. J48 performed the best in terms of accuracy according to its results in all cases. The new classifiers can assist cyber defense systems in applying the needed security policies and enhance the security and efficiency of the network. Future work includes building a classifier that can detect profit and non-profit websites according to website multimedia features.

Research Intelligence

Data from OpenAlex ↗

Metrics

16
Citations
5.10
FWCIfield-weighted
96th
Percentilevs same year + field
Article
Work type
Open Access

Citation Trend

Citation Timeline

YearCitations
20253
20249
20233
20221

Institution Network

References

  1. Gangeshwer, D.K., E-Commerce or Internet Marketing: A Business Review from Indian Context, International Journal of u-and e-Service, Science and Technology, 6, pp.187-194, 2013. DOI: 10.14257/ijunesst.2013.6.6.17
  2. Ebay.com, https://www.ebay.com.au/ (7 Sept 2021).
  3. Svaiko, G., The 10 Most Common Website Security Attacks, https://www.tripwire.com/state-of-security/featured/most-common-website-security-attacks-and-how-to-protect-yourself/, (15 Dec 2021).
  4. Blog, I.S.B., Common Cybersecurity Threats for E-Commerce Businesses, https://www.insureon.com/blog/top-cybersecurity-threats-for-ecommerce-businesses, (15 Dec 2021).
  5. Johnson, N., Why Website Security is Important for Your Business, https://www.inmotionhosting.com/blog/why-website-security-is-important-for-your-business/, (7 Sept 2021).
  6. Babapour, S.M. & Roostaee, M., Web Pages Classification: An Effective Approach Based on Text Mining Techniques, IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI), 2017.
  7. Qazi, A. & Goudar, R.H., An Ontology-based Term Weighting Technique for Web Document Categorization, International Conference on Robotics and Smart Manufacturing, 133, pp. 75-81, 2018.
  8. Sun, A., Lim, E.P. & Ng, W.K., Web Classification Using Support Vector Machine, Proceeding WIDM
  9. Hongjian, G. & Yifei, C., Web Classification Algorithm Using Support Vector Machine and Particle Swarm Optimization, IJACT, 4(17), pp. 514 – 520, 2012.
  10. Chun, Y., Yazhou, L. & Qiong, Q., An Approach for News Web-Pages Content Extraction Using Densitometric Features, Advances in Electric and Electronics Lecture Notes in Electrical Engineering, 155, pp. 135-139, 2012.
  11. Yazdani, M., Eftekhar, M. & Abolhassani, H., Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models, Advances in Knowledge Discovery and Data Mining, 13th Pacific-Asia Conference, pp.780-787, 2009.
  12. Fiol-Roig, G., Miró-Julià, M. & Herraiz, E., Data Mining Techniques for Web Page Classification, Highlights in Practical Applications of Agents and Multiagent Systems Advances in Intelligent and Soft Computing, 89, pp. 61-68, 2011.
  13. Ali, A.H., Hussain, Z.F. & Abd, S.N., Big Data Classification Efficiency Based on Linear Discriminant Analysis, Iraqi Journal for Computer Science and Mathematics, pp. 2788-7421, September, 2020.
  14. Ali, A.H. & Abdullah, M.Z., A Novel Approach for Big Data Classification Based on Hybrid Parallel Dimensionality Reduction Using Spark Cluster, Computer Science, 20(4), December, 2019.
  15. Ali, A.H. & Abdullah, M.Z., A Parallel Grid Optimization of SVM Hyperparameter for Big Data Classification using Spark Radoop, Journal of Modern Science, 6(1), 3, March 2020.
  16. Reviews, W., Readability Test Tool, https://www.webpagefx.com/tools/ read-able/, (7 Sept 2021).