SDOclust Evaluation Tests

DOI

SDOclust Evaluation Tests conducted for the paper: SDOclust: Clustering with Sparse Data Observers Context and methodology SDOclust is a clustering extension of the Sparse Data Observers (SDO) algorithm. SDOclust uses data observers as graph nodes and cluster them considering connected components and local thresholding. Observers' labels are subsequently propagated to data points.  In this repository, SDOclust is evaluated with 15 two-dimensional synthetic datasets, 138 multi-dimensional synthetic datasets, and 2 real-application datasets, and compared with HDBSCAN and k-means-- algorithms. This repository is framed within the research on the following domains: algorithm evaluation, clustering, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further clustering evaluation and comparison.    Technical details Experiments are conducted in Python 3. The file and folder structure is as follows:

[data2d] contains 15 two-dimensional datasets as CSV files (last column is the label). [dataMd] contains 138 multi-dimensional datasets as CSV files (last column is the label). [dataReal] contains 2 real/application datases as CSV files (last column is the label). [plots] contains plots (.png, .pdf) with results generated by test scripts. [tables] contains tables (.csv, .tex) with results generated by test scripts. [cddiag] contain scripts for generating critical difference diagrams with Wilcoxon signed rank tests and plots from conducted tests. "dependencies.py" installs required python packages. "tests_2d.py" runs 2d experiments. "tests_Md.py" runs multi-dimensional experiments. "test_mawi.py" runs experiments with real network traffic data from MAWI captures. "test_sirena.py" runs experiments with real electricity consumption data from the Sirena project. "sdo.py" implements sdoclust functions. "pamse2d.py" runs sensitivity analysis on SDOclust parameters. "update_test.py" shows an example of SDOclust working in update modus, "gbc.py" contains functions for the graph-based clustering implementation (based on https://github.com/dayyass/graph-based-clustering). "kmeansmm.py" is the k-means-- implementation (based on https://github.com/Strizzo/kmeans--). "LICENSE" file. "README.md" for further details, link to sources and instructions for reproducibility. License The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GNU GPL license.

Identifier
DOI https://doi.org/10.48436/3q7jp-mg161
Related Identifier IsDerivedFrom https://doi.org/10.1109/ICDMW.2018.00140
Related Identifier IsVersionOf https://github.com/CN-TU/pysdoclust
Metadata Access https://researchdata.tuwien.ac.at/oai2d?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:researchdata.tuwien.ac.at:3q7jp-mg161
Provenance
Creator Iglesias Vázquez, Félix (ORCID: 0000-0001-6081-969X)
Publisher TU Wien
Publication Year 2023
Rights Creative Commons Attribution 4.0 International; GNU General Public License v3.0 or later; https://creativecommons.org/licenses/by/4.0/legalcode; https://www.gnu.org/licenses/gpl-3.0-standalone.html
OpenAccess true
Contact tudata(at)tuwien.ac.at
Representation
Resource Type Software
Version 1.0.0
Discipline Other