Clustering algorithms in data mining | Статья в журнале «Молодой ученый»

Отправьте статью сегодня! Журнал выйдет 2 ноября, печатный экземпляр отправим 6 ноября.

Опубликовать статью в журнале

Автор:

Рубрика: Информационные технологии

Опубликовано в Молодой учёный №4 (242) январь 2019 г.

Дата публикации: 29.01.2019

Статья просмотрена: 33 раза

Библиографическое описание:

Негматов, У. М. Clustering algorithms in data mining / У. М. Негматов. — Текст : непосредственный // Молодой ученый. — 2019. — № 4 (242). — С. 14-15. — URL: https://moluch.ru/archive/242/55919/ (дата обращения: 19.10.2024).



Key words: clusters, algorithms, data mining, distance, metrics, normalize, k-means

In [1] [2], the authors have described that clustering algorithms divide a data set into many groups which aims to establish the input dataset in to a set of finite number of groups with respect to some similar quantity. These clustering algorithms can be used both normalized and non-normalized data. If users have normalized data then number of iteration of the algorithms are lesser. So most of the situation normalized data offers good outcome as compared to non-normalizes data. Among of these many clustering algorithms, Density based clustering is the most popular data mining algorithm. This paper, all these clustering algorithms are differentiated according to their own properties. Several issues associated with the use of these clustering techniques are described and emphasizing on some challenges of these algorithms.

Clustering algorithm also uses distance formula. When data is higher dimension [2] then uses Minkowski metric,

(1)

where d is dimension of the data.

In case of Euclidean distance, value of p=2, and Manhattan distance the value of p=1.

Some clustering algorithms work at normalized data such as distributed K-Means clustering. Data normalization is the way to linear transforms data to a precise range.

There are several approaches for normalization. The author in [1] have addressed the most popular methods are Min-Max data Normalization, data normalization by Decimal Scaling and Z-score data Normalization. The Min-Max Normalization performed a linear transformation on the original data. In [2], the authors supposed that we have attribute A and Maxa, Mina are the maximum and the minimum values of that attribute. Min-Max normalization maps a value (suppose V of A-V) in the range (0, 1) by computing

(2)

In Z-score normalization, the values for an attribute (attribute A) is normalized based on the Mean value and Standard Deviation of the attribute(A). Value (suppose V) of attribute A is normalized to v by calculating:

(3)

where, is Mean value and σA is standard deviation.

This method works effectively in two cases: when the actual minimum value and maximum value of attribute (A) is unfamiliar and when there are noise which dictate the min-max data normalization.

In case of Decimal scaling data Normalization the value (suppose V) of attribute (suppose A) is normalized to V’ by calculating:

(4)

Where j is small integer.

In this article, I analyzed many data mining algorithms which are K-Means clustering, Distributed K-Means algorithm, K-Medoids clustering, Hierarchical clustering, Grid -based clustering and Density based clustering [2] and after analyzing I have shown result according to its performance.

The area of data mining are used various clustering approaches. But every clustering technique has some advantage and disadvantage. Every clustering technique is not appropriate for all the condition. These clustering algorithms work as follows:

K-means clustering technique is a way to organize the data items base on some features into K group. Where K is positive integer. Grouping is completed by reducing the total amount of squares of distances between cluster centroid and data.

K- Medoids Clustering Method is same as K-mean clustering but here calculate medoids instead of mean. Partitioning around medoids works successfully for small data sets it does not works for huge data sets. The time complexity is O (k(n-k)/2) for each repetition where n is number of data objects and k is number of groups.

Difference between K-Means clustering and K-Medoids clustering: K-means Compute group centre but in K-medoids clustering each group’s centroid is denoted by a point with in the groups. K-means is less strong than K-medoids in existence of noise because a medoids are less effected by noisy values. So both clustering algorithms are not gives good performance for noisy data [3]. So new clustering algorithm came for normalized data which called Distributed K-Means clustering.

Hierarchical Clustering technique is a process of cluster (group) analysis which form a hierarchy of clusters (groups). Hierarchical clustering is two type.

1)Agglomerative Hierarchical clustering or AGNES (agglomerative nesting)

2)Divisive Hierarchical clustering or DIANA (divisive analysis).

Grid-based Algorithm makes grid. It works as follows.

1.Label the set of grid-cells

2.Give objects to the appropriate grid cell and calculate the density of every cell.

3.Remove cells, whose density value is below a certain threshold value assume t.

4.Form clusters from adjacent groups of dense cells.

There are two approaches: First is STING (a statistical information Grid approach) and second one is CLIQUE

References:

  1. Ignatiev N. A. Obobschennye ocenki i lokalnye metriki obyektov v intellektialnom analize dannyh // NUU, Tashkent, 2015.
  2. Vorontsov K. V. Matematicheskie metody obucheniya po precendentam //Wiki-source MachineLearning.ru
  3. Ignatyev N. A., Madrakhimov Sh.F., Saidov D. Y. Stability of object classes and selection of the latent features // International journal of engineering technology and sciences (IJETS), 2017, Malaysia, Vol. 7
Основные термины (генерируются автоматически): A-V, AGNES, CLIQUE, DIANA, IJETS, NUU, STING.


Ключевые слова

Data Mining, clusters, algorithms, distance, metrics, normalize, k-means

Похожие статьи

Recommender system based on demographic-based recommendation algorithm

The idea of user-based, content-based and demographic-based recommendation mechanism is adopted to analyze and calculate these real data and extract the recommendation results to recommend to users to complete personalized movie recommendation.

Scheduling of housing construction by maneuvering workforce

Human resource management of internet enterprises in the era of big data

The research draws lessons from the practices of large Internet companies Google and Tencent in human resource big data.

Analyzing of Cloud storage protection methods of personal data

In this paper are described requirements of cloud storage systems and their comparison by features.

Process-genre based approach to teaching writing

Application of searching algorithm for finding shortest paths in a weighted graph for economy on long-distance train journey

The article describes the practical application of the graph theory, shows the application of the shortest-path finding algorithm, uses program realization to find a solution, finds the most economical way to get from one point to another.

The use of computer training systems to optimize chemical-technological processes

This article discusses a method for optimizing chemical-technological processes using training complexes. The article also describes the requirements for the mathematical model of the simulator. Using computer training complexes of «real time”, it is...

Features of developing mobile applications on the Thunkable platform

This article discusses the possibility of using a cloud environment for developing mobile applications, called Thunkable, in educational processes. The main features of working with the environment, its advantages and disadvantages are considered.

Increasing the stability through the preprocessing anomalous objects in a given data

In this article is offered the numerical algorithm for computation the stability of classified objects. It is invited to change the class of anomalous objects, which has similar regularities to improve the stability. First result and second result, w...

Application OFDM signal in the physical layer network WiMax

This article presents the use OFDM signals in the physical layer WiMax network. Using OFDM signal with a large number of subcarriers allows WiMax systems to effectively serve users in a direct line of sight, as well as moving subscribers.

Похожие статьи

Recommender system based on demographic-based recommendation algorithm

The idea of user-based, content-based and demographic-based recommendation mechanism is adopted to analyze and calculate these real data and extract the recommendation results to recommend to users to complete personalized movie recommendation.

Scheduling of housing construction by maneuvering workforce

Human resource management of internet enterprises in the era of big data

The research draws lessons from the practices of large Internet companies Google and Tencent in human resource big data.

Analyzing of Cloud storage protection methods of personal data

In this paper are described requirements of cloud storage systems and their comparison by features.

Process-genre based approach to teaching writing

Application of searching algorithm for finding shortest paths in a weighted graph for economy on long-distance train journey

The article describes the practical application of the graph theory, shows the application of the shortest-path finding algorithm, uses program realization to find a solution, finds the most economical way to get from one point to another.

The use of computer training systems to optimize chemical-technological processes

This article discusses a method for optimizing chemical-technological processes using training complexes. The article also describes the requirements for the mathematical model of the simulator. Using computer training complexes of «real time”, it is...

Features of developing mobile applications on the Thunkable platform

This article discusses the possibility of using a cloud environment for developing mobile applications, called Thunkable, in educational processes. The main features of working with the environment, its advantages and disadvantages are considered.

Increasing the stability through the preprocessing anomalous objects in a given data

In this article is offered the numerical algorithm for computation the stability of classified objects. It is invited to change the class of anomalous objects, which has similar regularities to improve the stability. First result and second result, w...

Application OFDM signal in the physical layer network WiMax

This article presents the use OFDM signals in the physical layer WiMax network. Using OFDM signal with a large number of subcarriers allows WiMax systems to effectively serve users in a direct line of sight, as well as moving subscribers.

Задать вопрос