1. Home
  2. Archives
  3. Vol 18 (2024) Issue 2
  4. Articles

A Multivariate Fuzzy Weighted K-Modes Algorithm with Probabilistic Distance for Categorical Data

Abstract

Data clustering is a data mining approach that assigns similar data to the same group. Traditionally, cluster similarity considers all attributes equally, but in real-world applications, some attributes may be more important than others. Therefore, this study proposes an algorithm that utilizes multivariate fuzzy weighting to demonstrate the varying importance of each attribute, using a Gini impurity measure for weight assignment. Additionally, the proposed algorithm implements probabilistic distance to reduce sensitivity to noise. Probabilistic distance offers more detailed information and better interpretation than Hamming distance, which ignores attribute positions. Probabilistic distance utilizes information about the attribute’s position within and between clusters. This enhances clustering performance by creating clusters with more similar attributes. Therefore, the proposed Multivariate Fuzzy Weighted K-Modes with Probabilistic Distance for Categorical Data (MFWKM-PD) algorithm, based on the multivariate fuzzy K-modes algorithm, not only considers detailed membership calculations but also considers the varying contributions of attributes and their positions in distance calculation. This study evaluated the proposed MFWKM-PD using several benchmark datasets. The experiments validated that the proposed MFWKM-PD shows promising results compared to other algorithms in terms of accuracy, NMI, and ARI.

Keywords

1 Introduction

Nowadays, the amount of digital data is growing exponentially and becoming more complex both on variety and sources. Therefore, the question of "how to effectively process this abundance of data to derive meaningful insight that can support decision makers" has become more crucial than in previous times when traditional data analysis methods were still effective. Data mining is an approach to explore and extract important information from data. It uses a variety of techniques, such as clustering and classification. Classification is a supervised learning method where the data is given predetermined labels. In contrast, clustering groups similar data points together without using labels and is called an unsupervised learning technique. Clustering works without prior knowledge of specific labels to help make better decisions in many areas related to data segmentation and grouping. For example, customer segmentation on purchasing behavior, in which customers are grouped together according to similarities in their buying patterns. In molecular biology, cluster analysis can help scientists find the relationships between genes from large amounts of genomic data. Examples can also be found in other areas, such as medicine, biology, ecommerce [1-3], and so on.

Data clustering generates clusters according to the similarity of the data. Therefore, distance measurement is a critical issue in data clustering. The distance measurement used for calculating data similarity (or dissimilarity) should be defined based on the data type. The existing distance measurement, such as Hamming distance, has difficulty capturing the relation between the attributes since this metric only considers the dissimilarities between objects or objects with their centroids. Other distance measurements can be used, such as probabilistic distance based on kernel density estimation [4]. As opposed to the distance from the object to the centroid, probabilistic distance is specified as the distance from the object to the cluster.

Moreover, clustering methods can be divided into two categories based on how the clusters are produced, i.e., hierarchical and partitional clustering. Hierarchical clustering allows clusters to have sub-clusters, while partitional clustering only assigns each unlabeled object to one cluster [5]. Furthermore, in partitional clustering, clustering methods can be divided into hard and soft clustering, which differ in membership weight. Hard clustering, such as K-means [6] or K-modes [7], assigns each instance to a single cluster. In contrast, soft or fuzzy clustering, like in the fuzzy c-means algorithm (FCM) [8] and the fuzzy K-modes algorithm (FKM) [9], assigns each instance to multiple clusters with different membership values. Unlike FCM, FKM was developed to cluster the categorical data. However, there are some drawbacks when the number of values of each attribute increases. For instance, they treat all attributes as equally important, while in a real application, the contribution of the attributes can be different. The multivariate fuzzy K-modes algorithm (MFKM) is an algorithm that can handle differences between the values of each attribute of different clusters [10].

MFKM is an algorithm constructed by using the multivariate fuzzy c-means algorithm (MFCM) [11]. Both take a multivariate approach to determining the degree of membership. Certain methodologies employ a weighting technique. As an illustration, the weighted multivariate membership fuzzy c-means (WMFCM-M) algorithm and the multivariate membership integrated with weighted distances by cluster and variable (WMFCM-D) algorithm use the distance

parameter's weight to measure the variability dispersion within clusters, thereby facilitating the identification of cluster shapes [12]. Additional research has explored the utilization of impurity metrics, including entropy and Gini impurity, to enhance clustering performance [13-15]. Moreover, Kim [14] proposed a method for assigning weights that considers the attribute distributions within and between clusters. As a result, each attribute has a different weight and contribution that impacts the separation of objects into different clusters.

This study intended to develop a clustering algorithm that addresses the limitations of the above-mentioned algorithms by combining a weighting method and probabilistic distance, namely the multivariate fuzzy weighted K-modes algorithm with probabilistic distance for categorical data (MFWKM-PD). Three validity indexes—accuracy, normalized mutual information, and modified Rand index—were used to measure the performance of MFWKM-PD. This study implemented the proposed algorithm on several benchmark datasets. The results were compared with some clustering algorithms.

The remaining of this paper is organized as follows. Section 2 presents a brief review of the MFKM algorithm. Section 3 discusses the proposed MFWKM-PD algorithm. Section 4 presents the experimental results. Finally, the concluding remarks are given in Section 5.

2 Multivariate Fuzzy K-Modes Algorithm (MFKM)

The most popular fuzzy clustering algorithm to cluster categorical data is called the FKM algorithm; nevertheless, it reflects the same membership degrees for every attribute [9]. On the other hand, assigning different values for each attribute in different clusters is possible. Therefore, the FKM algorithm may be combined with the multivariate approach called multivariate fuzzy K-modes (MFKM) [10]. This algorithm was adopted from the multivariate fuzzy c-means algorithm for numerical data (MFCM) [12]. MFKM and MFCM are based on the same idea: to find a multivariate fuzzy partition and form multiple membership matrices.

MFKM has three important characteristics: (1) being able to determine each object's significance for a certain group based on each attribute; (2) being able to extract more information from data, which provides higher clustering quality; and (3) being a multivariate data analysis tool. Consequently, it produces different membership degrees, allowing it to handle the ambiguity of the data precisely [10].

In general, the MFKM algorithm has to represent the different memberships for each attribute in the different clusters and calculate the distance between clusters and centroids. The illustration of the membership distribution can be seen in

Table 1, where there are five records (i) with two attributes (l) and three clusters (j).

Ü
V1V2j =j = 1 \(j = 1\) \(l = 2\) l = 1= 2j == 3T-4-1
ΛIAZl = 1l = 2l = 1l = 2l = 1l = 2Total
i = 1120.080.220.200.230.030.241.00
i = 2110.370.300.030.110.150.041.00
i = 3450.010.020.100.110.170.591.00
i = 4980.320.090.090.320.020.161.00
i = 5870.070.130.590.160.010.041.00
1611941.455.00

Table 1 Membership distribution example of MFKM algorithm.

Given X, a set of n categorical objects. Each object \(x_i\) is defined as a set of m categorical attributes so that \(x_i = \{x_{i1}, x_{i2}, ..., x_{im}\}\). MFKM partitions X into k clusters denoted as \(C_1, C_2, ..., C_k\) by minimizing the objective function in Eq. (1).

\[J^{M} = \sum_{i=1}^{k} \sum_{i=1}^{n} \sum_{l=1}^{m} u_{ili}^{\alpha} d(x_{il}, z_{il})\] (1)

with subject to constraints in Eqs. (2) to (4):

\[0 \le u_{ili} \le 1, \quad \forall j = 1, \dots, k; i = 1, \dots, n; l = 1, \dots, m,\] (2)

\[\sum_{i=1}^{k} \sum_{l=1}^{m} u_{il} = 1, \quad \forall i = 1, ..., n,\] (3)

\[0 < \sum_{i=1}^{n} \sum_{l=1}^{m} u_{ili} < n, \quad \forall j = 1, ..., k,\] (4)

where \(\alpha\) is a fuzziness component, and k is a predetermined number of clusters, while \(U = [u_i]\) with (i=1, 2, ..., n) is a multivariate fuzzy partition. U contains n multivariate membership matrix \(u_i = [u_{jli}]\) where \(u_{jli}\) is the membership degree of the object i to cluster j on attribute l. \(z_{jl}\) is the centroid of cluster j on attribute l. \(d(x_{il}, z_{jl})\) is the distance between \(x_{il}\) and its responding centroid \(z_{jl}\). The matching distance measure is represented in Eq. (5) as follows:

\[d(x_{il}, z_{jl}) = \begin{cases} 0, & \text{if } x_{il} = z_{jl} \\ 1, & \text{if } x_{il} \neq z_{jl} \end{cases}\] (5)

where \(z_{il}\) is the \(j^{th}\) element of \(Z_i\) and \(x_i\) is the \(i^{th}\) point of \(X_i\).

The MFKM clustering algorithm is described as follows:

1. Initialization

(Fix c, \[2 \le k \le n\]; fix m, \(1 \le \alpha \le \infty\); fix T; and fix \(\varepsilon > 0\))

Randomly initialize \(u_{jli}(j=1,...,k; l=1,...,m; i=1,...,n)\) of pattern x belonging to cluster \(C_i\) on attributes l such that \(u_{jli} \in [0,1]\) and \(\sum_{j=1}^k \sum_{l=1}^m u_{jli} = 1\). Do t=1.

2. Representation

(Fix membership \(u_{jli}\) of pattern x(x = 1, ..., n) belonging to class \(C_i\) on the attributes l(l = 1, ..., m)). Compute centroid \(z_j\) of cluster \(C_j\) (i = 1, ..., k) using Eq. (6):

\[r^* = \underset{1 \le r \le p_i}{arg \, max} \left\{ \sum_{l:x_{jl=a_l^r}} u_{jli}^{\alpha} \right\}. \tag{6}\]

3. Feature Membership

(Fix centroid \(z_j\) of cluster \(C_j\) (i = 1, ..., k)). Update the fuzzy membership degree, \(u_{ili}\), using Eq. (7):

\[u_{jli} = \left[ \sum_{p=1}^{k} \sum_{q=1}^{m} \left( \frac{d(x_{il}, z_{jl})}{d(x_{iq}, z_{pq})} \right)^{\frac{1}{\alpha - 1}} \right]^{-1}.\] (7)

4. Stopping Criterion

If \(\|J_M^{t+1} - J_M^t\| \le \varepsilon\) or t > T, go to Step (5); otherwise, update t = t + 1 and go Step 2.

5. Class Membership

(Fix centroid \(z_j\) and membership \(u_{jli} = (j = 1, ..., k), (l = 1, ..., m), (i = 1, ..., n)\)). Compute the fuzzy membership degree of object x belonging to cluster \(C_j\), \((\gamma_{ji})\) using Eq. (8):

\[\gamma_{ii} = \sum_{l=1}^{m} u_{ili}. \tag{8}\]

3 Proposed Algorithm

In real-world applications, the attributes have different contributions; therefore, the importance of the attributes can be different. Using impurity metrics such as entropy and Gini impurity, each attribute can be weighed during the clustering process by considering the within-cluster and between-cluster relationship [14]. Moreover, the proposed algorithm uses probabilistic distance, which differs from possibilistic distance, instead of Hamming distance [4] to reduce the noise sensitivity.

Thus, the MFKM algorithm, which integrates the weighting method and probabilistic distance, becomes the multivariate fuzzy weighted K-modes algorithm with probabilistic distance (MFWKM-PD).

In Eq. (9), the objective function of the proposed MFWKM-PD is defined to find U, Z, and W to minimize F(U, Z, W) as follows:

\[F(U,Z,W) = \sum_{i=1}^{k} \sum_{l=1}^{n} \sum_{l=1}^{m} u_{ili}^{\alpha} w_{l} d(x_{il}, z_{il})\] (9)

Give X, a set of n categorical objects. Each object \(x_i\) is described by a set of m categorical attributes so that \(x_i = \{x_{i1}, x_{i2}, ..., x_{im}\}\), k is a predefined number of clusters denoted as \(C_1, C_2, ..., C_k\), \(\alpha\) is a fuzziness component, U as a multivariate fuzzy partition contains n multivariate membership matrix \(u_i = [u_{jli}]\) where \(u_{jli}\) is the membership degree of the object i to cluster j on attribute l, Z represents cluster centers, and \(w_l\) is the weight of attribute l.

As \(w_l\) is a weight of attribute l, consider \(w_{l,w}\) is a weight of attribute l based on within-cluster information as well as \(w_{l,b}\) as a weight of attribute l based on between-cluster information. Thus, in Eq. (10), the weight of attribute l (\(w_l\)) is defined as follows:

\[w_l = aw_{l,w} + (1 - a)w_{l,b}. (10)\]

The adjusted parameter between two weight components \(w_{l,w}\) and \(w_{l,b}\) is defined as a, where \(a(0 \le a \le 1)\). Since the probabilistic distance on the proposed algorithm uses Gini Impurity, the final weights of \(w_{l,w}\) and \(w_{l,b}\) are calculated based on Eqs. (11) and (12):

\[w_{l,w} = \frac{\sum_{j=1}^{k} \delta_j (1 - I_w(f_{lj}))}{\sum_{l=1}^{m} \sum_{j=1}^{k} \delta_j (1 - I_w(f_{lj}))},\] (11)

\[l_{l,b} = \frac{1 - l_b(g_l^r)}{\sum_{l=1}^m 1 - l_b(g_l^r)}\] (12)

where \(\sum_{l=1}^{m} w_{l,w} = 1\), For categorical attributes, \(\delta_j\) is a weight that is proportional to the number of data samples that are part of the j-th cluster. It can be defined as \(\delta_j = \frac{n_j}{n}\), where \(n_j\) is the number of objects in cluster \(C_j\). Furthermore, the Gini impurities of each category attribute, which are based on the distribution of categories within the clusters, are presented in Eq. (13) as \(I_w(f_{lj})\):

\[I_w(f_{lj}) = 1 - \sum_{r=1}^{h_l} (f_{lj}^r)^2\] (13)

where \(h_l\) is the set of categories in attribute l. The Gini impurity, \(I_b(g_l^r)\), is a distribution of categories across clusters. It is defined in Eq. (14):

\[I_b(g_l^r) = 1 - \sum_{j=1}^k (g_{lj}^r)^2, \tag{14}\]

Regarding the distribution of categories, the distribution of categories of attribute l in cluster j can be considered as \(f_{lj}\) and \(f_{lj}^r\) as the frequency ratio of category r of attribute l in cluster j. Moreover, the distribution of attribute l across clusters is considered as \(g_l^r\) and \(g_{lj}^r\) as the frequency ratio of objects whose category value of the attribute l is \(a_l^r\).

The fuzzy weighting method is used to compute \(f_{lj}^r\) and \(g_{lj}^r\) with membership degree u. Suppose the probability \(x_{il} = a_l^r\) in the j-th cluster, and the probability of the object with \(x_{il} = a_l^r\) also belong to the j-th cluster, then \(f_{lj}^r\) and \(g_{lj}^r\) are defined in Eqs. (15) and (16):

\[f_{lj}^{r} = \frac{\sum_{x_{il} = a_{l}^{r} u_{ji}}}{\sum_{i=1}^{n} u_{ii}}\] (15)

and

\[\text{[rumus tidak dapat ditampilkan dengan baik — lihat PDF asli]}\] (16)

The proposed MFWKM-PD algorithm employs probabilistic distance instead of Hamming distance [4]. Rather than calculating the object-to-centroid distance, this method computes the object-to-cluster distance. Therefore, distance \(d(x_{il}, z_{jl})\) in Eq. (9) becomes \(d(x_{il}, C_j)\) which represents the distance of \(x_i\) to cluster j on attribute l as defined in Eq. (17):

\[d(x_{il}, Z_j) = \sum_{t \in h_l} [p(x_l = t | x_{il}) - p(x_l = t | C_j)]^2\] (17)

where the \(h_l\) is the set of categories. For instance, the attribute d takes \(|h_l|\) discrete values. In the set, an arbitrary category is indicated by \(t \in h_l\), where \(l \in [1, |h_l|]\). Since all the categories in \(h_l\) are assumed to be independent of one another, Eq. (18) is used to estimate the probability \(p(x_l = t|x_{il})\):

\[p(x_l = t | x_{il}) = I(t = x_{il})\] (18)

where \(I(t = x_{il}) = 0\) if \(x_{il} \neq t\), and otherwise, \(I(t = x_{il}) = 0\)

The kernel functions \(p(x_l = t | C_j)\) determine the probability density of \(x_l\), where \(t \in h_l\). It is defined in Eq. (19):

\[p(x_l = t | C_j) = \beta_j \frac{1}{|h_l|} + (1 - \beta_j) f_j(t)\] (19)

The smoothing parameter, \(\beta_j\), is called the bandwidth, and the frequency estimator of t in cluster \(C_j\) is formulated in Eq. (20):

\[f_j(t) = \frac{1}{n_j} \sum_{x_i \in j} I(t = x_{il})\] (20)

For categorical data, the optimal bandwidth of \(\beta_j\) lies within [0, 1] and can be calculated using Eq. (21):

\[\beta_j = \frac{\sum_{l=1}^{m} s_{jl}^2}{(n_j - 1) \sum_{l=1}^{m} {|\frac{|h_l| - 1}{|h_l|}} - s_{jl}^2)}\] (21)

The number of objects in cluster j is represented as \(n_j\). The sample dispersion of categorical attributes is measured by the Gini diversity index \(S_{jl}^2\), which is provided in Eq. (22):

\[S_{il}^2 = 1 - \sum_{t \in h_l} [f_i(t)]^2 \tag{22}\]

Finally, the probabilistic distance is defined in Eq. (23):

\[d(x_{il}, C_j) = \sum_{t \in h_l} \left[ I(t = x_{il}) - \beta_j \frac{1}{|h_j|} - (1 - \beta_j) f_j(t) \right]^2\] (23)

Therefore, to minimize F(U, Z, W), the objective function of the proposed MFWKM-PD algorithm is presented in Eq. (24)"

\[F(U,Z,W) = \sum_{i=1}^{k} \sum_{l=1}^{n} \sum_{l=1}^{m} u_{ili}^{\alpha} w_l d(x_{il}, C_i)\] (24)

3.1 Updating Rules

Based on Eqs. (6-7), the cluster centroid \(z_{jl}\) of cluster j on attribute l and membership degree \(u_{jli}\) are updated using Eqs. (25) and (26) respectively:

\[z_{jl} = a_l^{(r)}, \ r = arg_{1 \le r \le |h_l|} max \left\{ \sum_{v: x_{vl} = a_l^r} u_{jvl}^m \right\}\] (25)

and

\[u_{jli} = \left[ \sum_{h=1}^{k} \sum_{s=1}^{m} \left( \frac{d(x_{il}, C_j)}{d(x_{is}, C_h)} \right)^{\frac{1}{\alpha - 1}} \right]^{-1}\] (26)

The object-to-cluster distance \(d(x_{il}, C_j)\) is calculated using Eq. (21); the weight attribute \(l(l_l)\) is calculated based on a fuzzy weighting method, \(f_{lj}^r\) and \(g_{lj}^r\), using Eq. (10).

The proposed MFWKM-PD follows the following procedures:

Step 1: Initialization – Initialize multivariate fuzzy partition \(U^1\) which contains n membership matrix \(u_i = \left[u_{jli}\right]\), (j=1,...,k; i=1,...,n; l=1,...,m) to satisfy the constraint. Generate the attribute weight \(W^1\) with \(w_l = 1/m\) for all attributes. Identify the centroid \(Z^1\) such that cost \(F(U^1,Z^1,W^1)\) is minimized. Set iteration t=1.

For t = 1 to max iteration

Step 2: Fix \(Z^t\) and \(W^t\) and update \(U^{t+1}\). If \(F(U^{t+1}, Z^t, W^t) = F(U^t, Z^t, W^t)\), then stop; Else go to Step 3

Step 3: Fix \(W^t\) and \(U^{t+1}\) and update \(Z^{t+1}\)If \(F(U^{t+1}, Z^{t+1}, W^t) = F(U^{t+1}, Z^t, W^t)\), then stop Else go to Step 4

Step 4: Fix \(U^{t+1}\) and \(Z^{t+1}\) and update \(W^{t+1}\). If \(F(U^{t+1}, Z^{t+1}, W^{t+1}) = F(U^{t+1}, Z^{t+1}, W^t)\) or iteration \(t = \max\) iteration, then stop.

Else return to Step 2

End for

4 Experimental Results

This study evaluated the proposed MFWKM-PD algorithm using three benchmark datasets from different fields, i.e., molecular biology, balance, and lymphography. Table 2 shows the details of the datasets, which have different numbers of instances (N), categorical attributes \((d_c)\), and clusters (k) and were retrieved from the UCI machine learning repository.

Table 2 Benchmark datasets.

DatasetsN\(d_c\)k
Molecular Biology [16]106572
Balance [17]625183
Lymphography [18]148184

The molecular biology dataset (E. Coli promoter gene sequences) has 57 categorical attributes and each attribute has four levels of categories; the balance dataset has 18 categorical attributes and each attribute has five levels of categories. Moreover, the lymphography dataset, even though it has a similar number of attributes to the balance dataset, has various levels of categories. There are ten attributes comprising two levels, two attributes comprising three levels, four attributes comprising four levels, and two attributes comprising eight levels.

Accuracy (AC), normalized mutual information (NMI), and modified Rand index (ARI) were used to measure the performance of MFWKM-PD. The AC metric is defined in Eq. (29), where \(a_i\) represents the correctness of cluster assignment and n is the total number of objects, \(x_i\).

\[AC = \frac{\sum_{i=1}^{n} a_i}{n}.\] (29)

The NMI metric is the normalized version of mutual information (MI). Its value is in the interval [0;1], where 1 indicates the perfect labeling between the clustering result and the class label. NMI is given by Eq. (30):

\[NMI = \frac{I(X,Y)}{\sqrt{H(X)H(Y)}} \tag{30}\]

The entropies of X and Y are represented by H(X) and H(Y), respectively, and I(X, Y) represents the mutual information between X and Y. Note that the attributes X and Y are random.

The ARI metric considers all cluster pairwise combinations and contains values in the interval [-1;1], where 1 indicates that the clusters are identical [19]. ARI is defined in Eq. (31):

\[ARI = \frac{RI - expected(RI)}{max(RI) - expected(RI)},\tag{31}\] where the Rand index (RI) measures the similarity between two cluster results by considering the number of pairs of elements belonging to the same or different clusters.

4.1 Results

In this experiment, the parameter settings for all algorithms were set as follows:

  • 1. The fuzziness component \(\alpha\) was 1.1 instead of 2 so the experiment gained better performance. Some papers performed better with a smaller \(\alpha\) value [9,14].
  • 2. The initial weight \(w_1\) was initialized as 1/m for each attribute l.
  • 3. The optimal value of the balancing parameter a between two weight components for most cases was close to 0 or 1. Therefore, the a value was set as 0.1 [14].

This study compared the performance of MFWKM-PD algorithm with several clustering algorithms, i.e., fuzzy K-modes (FKM), fuzzy weighted K-modes (FWKM), fuzzy K-modes with probabilistic distance (FKM-PD), fuzzy weighted K-modes with probabilistic distance (FWKM-PD), multivariate fuzzy K-modes (MFKM), multivariate fuzzy weighted K-modes (MFWKM), multivariate fuzzy K-modes with probabilistic distance (MFKM-PD), and multivariate fuzzy K-modes with probabilistic distance (MFWKM-PD).

Each algorithm was run 30 times with the same initial centroids for all algorithms and the average performance was calculated based on the average values. Therefore, the proposed algorithm had slightly modified steps regarding this experiment, using the random initial centroids instead of random multivariate membership. Tables 3, 4, and 5 present a summary of the computational results in terms of accuracy, NMI, and ARI, respectively. The standard deviation and average clustering accuracy for all algorithms are displayed in Table 3.

Table 3Average of accuracy(AC) andd standard deviation (SE)).
DatasetsIndorEIZMEWEMFKM- FWKM-MEZMMEWIZMMFKM- MFWKM-
DatasetsmuexL WIL AA KIAIPDPDMILKMIVIT VV KIVIPDPD
MolecularAC55.03156.82458.49160.37758.74260.12660.91261.509
BiologySD3.4713.6974.4395.9204.7655.9977.5268.573
BalanceAC54.35254.35254.46954.78454.20854.38955.10955.163
DaianceSD4.2354.2354.4934.6525.3204.1284.4905.438
LymphoAC65.69865.81165.45068.31163.94165.76655.85655.901
graphySD4.9425.0276.5146.7065.9376.5780.6480.613

The proposed algorithms won over the other algorithms for the molecular biology and balance datasets. This was because those algorithms have the same level of categories. The proposed algorithm also outperformed the other algorithms in terms of the average of NMI and ARI scores for two out of three datasets. However, the standard deviation of accuracy, NMI, and ARI scores obtained by the MFWKM-PD algorithm was not always better than that of the other algorithms. In general, FKM-based with weight and probabilistic distance (PD) algorithms had better results than FKM algorithms for all datasets. This proves that weight and probabilistic distance can improve FKM performance.

Average of NMI and standard deviation (SD). FKM-FWKM-Index FKM FWKM MFKM MFWKM PD PD PD

0.049

0.030

0.029

0.028

MFKM- MFWKM-PD NMI 0.011 0.018 0.029 0.046 0.031 0.044 0.090 0.099

SD0.0230.0230.0270.0290.0290.0230.0240.031
NMI0.1270.1260.1300.1570.1250.1280.0910.094
SD0.0550.0540.0640.0650.0480.0600.0430.040
۰

0.027

0.028

0.046

0.028

0.079

0.030

0.086

0.032

Table 5 Average of ARI and standard deviation (SD)

DatasetsIndexFKMFWKMFKM-
PD
FWKM-
PD
MFKMMFWKMMFKM-
PD
MFWKM-
PD
MolecularADI0.0050.0150.0280.0480.0300.0460.0640.076
BiologySD0.0030.0230.0460.0300.0400.0040.070
Diology0.0100.0210.0330.0300.0340.0350.0700.036
BalanceSD0.0260.0200.0230.0310.0270.0200.0320.030
LymphoARI0.0830.0840.0880.1290.0780.0880.0400.042
graphySD0.0500.0520.0640.0760.0560.0630.0200.018

4.2 Statistical Test

Table 4

SD 0.012 0.016

NMI 0.028 0.028

Datasets

Molecular

Biology

Balance

Lympho graphy

This study also conducted a statistical test to analyze how significantly the proposed algorithm outperformed other algorithms. The null hypothesis for the statistical test was "the proposed algorithm did not perform significantly differently from the other algorithms". The significant level was set as 95% between the algorithms, which implied that a was set as 5%. Table 6 shows the p-value for all datasets in terms of AC, ARI, and NMI. The result shows that two of three of the datasets had a p-value less than 0.05. Therefore, the null hypothesis was rejected. This means that the proposed algorithm had significantly different results from the other algorithms for the molecular biology and lymphography datasets. Furthermore, the Bonferroni adjustment was implemented to make a pairwise comparison between the algorithms to show how they were grouped.

Table 6 The statistic results of all datasets (p-value).

DatasetsACARINMI
Molecular Biology0.0000.0000.000
Balance0.9750.9880.999
Lymphography0.0000.0000.000

The results of the Bonferroni pairwise comparisons can be seen in Tables 7, 8, and 9. Three datasets were used to examine each group in terms of AC, ARI, and NMI. For the Balance dataset, MFWKM-PD performed slightly better than the other algorithms for all performance metrics. Still, in the molecular biology dataset, MFWKM-PD had a significantly better performance compared to the FKM-based algorithm, except for the FKM algorithms with probabilistic distance and weight. As shown in Table 7, the FKM-based algorithms, especially FWKM-PD, had a significantly better result on the lymphography data, but the proposed algorithm did not perform well.

Table 7 Bonferroni pairwise comparison for AC.

AlgorithmNDataset: Molecular
Biology
Datasett: BalanceDataset:
Lymphography
MeanGroupingMeanGroupingMeanGrGrouping
MFWKM-PD3061.5094A55.1627A55.9009С
MFKM-PD3060.9119A55.1093A55.8559C
FWKM-PD3060.3774A54.7840A68.3108A
MFWKM3060.1258A54.3893A65.7658AВ
MFKM3058.7421AВ54.2080A63.9414В
FKM-PD3058.4906AВ54.4693A65.4505AВ
FWKM3056.8239AВ54.3520A65.8108AВ
FKM3055.0314В54.3520A65.6982AВ

Moreover, in terms of ARI and NMI, as shown in Tables 8 and 9, the MFWKM-PD algorithm performed significantly different from the other algorithms on the molecular biology dataset. In contrast, the FKM-based algorithms had lower performances. With the Balance dataset, all algorithms were only slightly different; therefore, there were no significant differences among their performances. On the other hand, with the lymphography dataset, even though the proposed algorithms had the lowest ARI value, FKM with weight and probabilistic distance had the best performance.

AlgorithmNDataset: Molecu
Biology
Dataset:Dataset:
Lymphography
8.MeanGroupingMeanGroupingMeanGrGrouping
MFWKM-PD300.0762132A0.0359719A0.041697С
MFKM-PD300.0637444ΑВ0.0320827A0.040150C
FWKM-PD300.0479689ΑВC0.0310023A0.129494Α
MFWKM300.0460753ΑВC0.0262616A0.087775AВ
MFKM300.0304458ВC0.0289763A0.078398ВC
FKM-PD300.0276115ВC0.0285140A0.088116AВ
FWKM300.0146631C0.0282712A0.083588ВC
FKM300.0054138C0.0282712A0.082640ВC

Table 8 Bonferroni pairwise comparison for ARI.

Table 9 Bonferroni pairwise comparison for NMI.

AlgorithmNMolecular
logy
Dataset:BalanceDataset:
Lymphography
9.MeanGroupingMeanGroupingMeanGroupping
MFWKM-PD300.0992461A0.0315774A0.093781В
MFKM-PD300.0902149A0.0297288A0.090752В
FWKM-PD300.0461485В0.0299344A0.157197A
MFWKM300.0438051В0.0282079A0.128155AВ
MFKM300.0309028В0.0281922A0.124625AВ
FKM-PD300.0294507В0.0284125A0.130448AВ
FWKM300.0178571В0.0277592A0.126420AВ
FKM300.0108815В0.0277592A0.127219AВ

4.3 Computational Time

Table 10 presents the average computational time for all algorithms for each iteration. It reveals that the proposed MFWKM-PD algorithm needed more computational time for almost all datasets, except the lymphography dataset, where the MFKM algorithm with weight and probabilistic distance converged very fast..

Table 10 Computational time (in seconds).

DatasetsFKMFWKMFKM-
PD
FWKM-
PD
MFKMMFWKMMFKM-
PD
MFWKM-
PD
Molecular Biology0.3720.1240.1690.8742.4535.8744.8707.101
Balance0.0440.0420.2810.4513.6282.1652.1843.361
Lymphography0.1670.1630.1340.5678.2223.5213.2003.195

The molecular biology dataset having 57 categories compared to other datasets, which only have 18 categories, was the reason that the proposed algorithm needed more computational time. The impact of the number of attributes led to more computational time for the MFKM-based algorithm than the number of clusters.

Although the proposed MFWKM-PD did not have the lowest computational time, it could provide better performance in terms of AC, NMI, and ARI for the molecular biology and balance datasets

5 Conclusions

The MFWKM-PD algorithm, which is based on a multivariate approach, uses a Gini impurity weight that considers the distribution of attributes based on information from within and between clusters. This proposed algorithm also adopts probabilistic distance instead of Hamming distance to reduce the noise sensitivity. Evaluation results from three datasets showed that MFWKM-PD had better accuracy, ARI, and NMI performance, particularly with the molecular biology dataset. The proposed algorithm works well on datasets with the same category level regarding the number of categorical attributes and clusters; even compared to FKM-based algorithms with weight and probabilistic distance, it can perform well. This makes it useful for applications in molecular biology, such as identifying gene groups with similar functions or for clustering patients by molecular profiles.

However, the centroids being initialized randomly can lead to unstable results, as indicated by higher standard deviations. To address this, the initial centroids can be optimized using metaheuristic approaches, such as a genetic algorithm or a particle swarm optimization algorithm. Additionally, future studies can focus on determining the optimal number of clusters, which may be used to address the unsupervised problem in real-world applications.

Acknowledgment

The current study was financially supported by the National Science and Technology Council of the Taiwanese government under contract number: MOST 111-2221-E-011-076-MY3. It is appreciated.

Research Intelligence

Data from OpenAlex ↗

Metrics

1
Citations
0.31
FWCIfield-weighted
66th
Percentilevs same year + field
Article
Work type
Open Access

Citation Trend

Citation Timeline

YearCitations
20251

Institution Network

References

  1. Aljobouri, H.K., Jaber, H.A., Koçak, O.M., Algin, O. & Çankaya, I., Clustering fMRI Data with a Robust Unsupervised Learning Algorithm for Neuroscience Data Mining, J Neurosci Meth, 299, pp. 45-54, Apr. 1, 2018. DOI: 10.1016/j.jneumeth.2018.02.007. DOI: 10.1016/j.jneumeth.2018.02.007
  2. Bushel, P.R., Wolfinger, R.D. & Gibson, G., Simultaneous Clustering of Gene Expression Data with Clinical Chemistry and Pathological Evaluations Reveals Phenotypic Prototypes, Bmc Syst Biol, 1, Feb. 23, 2007. DOI: 10.1186/1752-0509-1-15. DOI: 10.1186/1752-0509-1-15
  3. Kuo, R.J., Potti, Y. & Zulvia, F.E., Application of Metaheuristic based Fuzzy K-modes Algorithm to Supplier Clustering, Comput. Ind. Eng., 120, pp. 298-307, 2018. DOI: 10.1016/j.cie.2018.04.050
  4. Chen, L.F., Wang, S.R., Wang, K.J. & Zhu, J.P., Soft Subspace Clustering of Categorical Data with Probabilistic Distance, PATTERN RECOGNITION, 51, pp. 322-332, MAR. 2016. DOI: 10.1016/j.patcog.2015.09.027. DOI: 10.1016/j.patcog.2015.09.027
  5. Jain, A.K., Murty, M.N. & Flynn, P.J., Data Clustering: A Review, ACM Comput. Surv., 31, pp. 264-323, 1999.
  6. MacQueen, J., Some Methods for Classification and Analysis of Multivariate Observations, 1967.
  7. Huang, Z., Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and knowledge discovery, 2(3), pp. 283-304, 1998. DOI: 10.1023/a:1009769707641
  8. Bezdek, J.C., Ehrlich, R. & Full, W.E., FCM: The Fuzzy C-means Clustering Algorithm, Computers & Geosciences, 10, pp. 191-203, 1984. DOI: 10.1016/0098-3004(84)90020-7
  9. Huang, J.Z. & Ng, M.K., A Fuzzy K-modes Algorithm for Clustering Categorical Data, IEEE Trans. Fuzzy Syst., 7, pp. 446-452, 1999. DOI: 10.1109/91.784206
  10. Maciel, D.B.M., Amaral, G.J.A., Souza, R. & Pimentel, B.A., Multivariate Fuzzy K-modes Algorithm, Pattern Analysis and Applications, 20, pp. 59-71, 2017. DOI: 10.1007/s10044-015-0465-3
  11. Pimentel, B.A. & Souza, R.M.C.R.D., A Multivariate Fuzzy C-means Method, Appl. Soft Comput., 13, pp. 1592-1607, 2013.
  12. Pimentel, B.A. & de Souza, R.M., Multivariate Fuzzy C-means Algorithms with Weighting, Neurocomputing, 174, pp. 946-965, 2016
  13. Cao, F., Liang, J., Li, D. & Zhao, X., A Weighting K-modes Algorithm for Subspace Clustering of Categorical Data, Neurocomputing, 108, pp. 23–30, 05/01 2013. DOI: 10.1016/j.neucom.2012.11.009. DOI: 10.1016/j.neucom.2012.11.009
  14. Kim, K., A Weighted K-modes Clustering using New Weighting Method based on Within-cluster and Between-cluster Impurity Measures, J. Intell. Fuzzy Syst., 32, pp. 979-990, 2017.
  15. Sen, P., Gini Diversity Index, Hamming Distance, and Curse of Dimensionality, Metron - International Journal of Statistics, LXIII, pp. 329-349, 02/01 2005.
  16. Harley, C., Reynolds, R. & Noordewier, M. Molecular Biology (Promoter Gene Sequences), UCI Machine Learning Repository, DOI: 10.24432/C5S01D. DOI: 10.24432/c5s01d
  17. Siegler, R. Balance Scale, UCI Machine Learning Repository, DOI: 10.24432/C5488X. DOI: 10.24432/c5488x
  18. Zwitter, M.A.S.M., Lymphography, UCI Machine Learning Repository, DOI: 10.24432/C54598. DOI: 10.24432/c54598
  19. Hubert, L.J. & Arabie, P., Comparing Partitions, Journal of Classification, 2, pp. 193-218, 1985. DOI: 10.1007/bf01908075