Replication Data for: Automatic Tuning based on Hardware Performance Counters and Machine Learning

Dataset

DOI

This dataset contains Hardware Performance Counters (HwPCs) measurements collected from parallel code regions executing on heterogeneous High Performance Computing platforms. The dataset includes comprehensive HwPC data from CPU architectures (Intel Xeon E5645 and Intel Xeon E5-4620) and GPU architectures (NVIDIA GeForce GTX 680 Kepler, GTX 750 Maxwell, GTX 1070 Pascal, RTX 2080 Turing, RTX 3090 Ampere, and RTX 4080 Ada Lovelace). Data was collected from multiple benchmark suites including STREAM, PolyBench, NAS Parallel Benchmarks, and GPU kernels (Convolution, Coulomb Sum, N-body, Transposition, GEMM, Reduction, Biconjugate Gradient, and Hotspot). The dataset encompasses measurements across varying problem sizes, thread configurations, affinity policies, scheduling strategies, and chunk sizes for OpenMP regions, as well as tuning parameters for GPU kernels including work-group size, work-item coarsening, memory caching strategies, tile sizes, loop unrolling, and vectorization. This data supports machine learning-based optimization of parallel applications through automated selection of minimal Hardware Performance Counter sets for code region identification and tuning parameter optimization.

Python, 3.7

GCC compiler, 9.2.0

METHODOLOGICAL INFORMATION

Description of methods used for collection-generation of data:

CPU Data Collection: Hardware Performance Counter data was collected using the Performance Application Programming Interface (PAPI) on two Intel Xeon platforms: - Intel Xeon E5645 (STREAM benchmark data for memory bandwidth and latency analysis) - Intel Xeon E5-4620 (32 cores - comprehensive benchmark suite)

Code regions were extracted from: - STREAM benchmark (Copy, Scale, Sum, Triad) - memory bandwidth characterization - PolyBench suite (12 regions) - synthetic benchmarks for scientific/engineering applications - NAS Parallel Benchmarks (7 regions from BT, CG, and LU benchmarks) - CPU-bound parallel workload performance - Custom benchmarks (Collatz sequences, Friendly numbers) - varying computational loads

The systematic dataset construction methodology varied OpenMP parameters including number of threads (up to 32), thread affinity policies (close, spread), scheduling strategies (static, dynamic, guided), and chunk sizes. Problem sizes were proportionally scaled to memory hierarchy levels (L1, L2, L3 caches, and main memory) based on number of physical cores and processors. For each configuration, PAPI preset event groups were measured across multiple executions, with compatible events grouped to overcome hardware monitoring limitations. Each unique combination of HwPC group, problem size, and configuration parameter was executed 100 times for statistical significance.

Total executions per platform were calculated as: E = S × T × P × OP × N, where S = HwPC sets (12), T = thread configurations (32), P = problem sizes (29), OP = OpenMP tuning parameter combinations (11), and N = repetitions (100), yielding 12,249,600 executions for the Xeon E5-4620 platform.

GPU Data Collection: Hardware Performance Counter data for GPU experiments was collected using the Kernel Tuning Toolkit (KTT) across six NVIDIA GPU architectures: - GeForce GTX 680 (Kepler) - 5 base code regions - GeForce GTX 750 (Maxwell) - 5 base code regions - GeForce GTX 1070 (Pascal) - 6 code regions (adds Reduction) - GeForce RTX 2080 (Turing) - 7 code regions (adds Biconjugate Gradient, Hotspot) - GeForce RTX 3090 (Ampere) - 5 base code regions - GeForce RTX 4080 (Ada Lovelace) - 6 code regions (adds Biconjugate Gradient)

Base code regions (present across all or most architectures): Convolution, Coulomb Sum, N-body, Transposition, GEMM Additional regions: Reduction (Pascal), Biconjugate Gradient (Turing, Ada Lovelace), Hotspot (Turing)

All GPU code regions use CUDA as the parallelization model.

Tuning parameters varied across code regions and included: work-group size, work-item coarsening, local memory caching, private memory caching, tile size, loop unrolling, local memory padding, and explicit vectorization.

Compilation and Execution Environment: - CPU benchmarks: GCC version 9.2.0 with -O2 optimization flag - GPU benchmarks: * GTX 680 and GTX 750: NVCC with CUDA [version to be specified] * GTX 1070 and RTX 2080: NVCC with CUDA 10.1 (driver 418.67) * RTX 3090: NVCC with CUDA 12.1 (driver 535.183) * RTX 4080: NVCC with CUDA 12.3 (driver 545.23)

References: - Browne, S., et al. (2000). A Portable Programming Interface for Performance Evaluation on Modern Processors. International Journal of High Performance Computing Applications. - Petrovič, F., et al. (2020). Benchmarking auto-tuning search spaces for GPU kernels [original benchmark reference] - Petrovič, F., et al. (2023). Kernel Tuning Toolkit: A flexible infrastructure for efficient auto-tuning.

Methods for processing the data:

The collected raw Hardware Performance Counter measurements underwent systematic preprocessing:

Normalization: HwPC values were normalized to ensure comparability across different execution scales and architectures.
Data cleaning: Null values and zero-variance features were identified and removed to eliminate uninformative counters.
Feature engineering: For CPU datasets, the performance index Pi(X) was calculated using the formula: Pi(X) = (X × T_t(X)²) / T_t(1), where X is the number of threads, T_t(X) is execution time with X threads, relating execution time to resource efficiency.
Optimal configuration labeling: For each code region and tuning parameter combination, optimal configurations were identified as those minimizing the objective function (Pi(X) for number of threads, execution time for affinity/scheduling/chunk size). These served as ground truth labels for supervised learning.

Machine Learning Pipeline: The processed data was used to train separate ensemble models for: (1) code region identification, and (2) tuning parameter optimization for each identified region. Model performance was evaluated using accuracy, precision, recall, F1-score, and ROC AUC metrics.

Instrument- or software- specific information needed to interpret the data:

Required Software: - PAPI (Performance Application Programming Interface) - version compatible with target CPU architecture Available at: https://icl.utk.edu/papi/

Kernel Tuning Toolkit (KTT) - for GPU data collection Available at: https://github.com/Fillo7/KTT
Python 3.7+ with libraries:
scikit-learn (Logistic Regression, Random Forest)
xgboost
pytorch-tabnet (TabNet implementation)
numpy, pandas (data manipulation)
matplotlib, seaborn (visualization)
GCC compiler 9.2.0 (CPU benchmarks)
NVIDIA CUDA Toolkit:
CUDA 10.1 for GTX 1070 and RTX 2080
CUDA 12.1 for RTX 3090
CUDA 12.3 for RTX 4080
OpenMP runtime library (for parallel CPU execution)
Archive extraction tools:
7-Zip or p7zip for multi-part ZIP archives (.zip.001, .zip.002, etc.) and .7z files
On Linux/macOS: cat command to concatenate multi-part archives before extraction

File Formats: - HwPC measurements: CSV format with counter values as numerical data - Compressed archives: Multi-part ZIP (.zip.001, .zip.002, etc.) and 7z format - Model checkpoints: Pickle format - Configuration files: JSON format

The dataset uses open, standard formats (CSV for tabular data) to ensure accessibility without proprietary software.

Instruments, calibration and standards information:

Hardware Platforms:

CPU Platforms: - Processor 1: Intel Xeon E5645 (used for STREAM benchmark data) Purpose: Memory bandwidth and latency characteristics analysis

Processor 2: Intel Xeon E5-4620 (32 cores) Memory hierarchy: L1, L2, L3 caches and main memory PAPI preset events: 50 counters grouped by category (Branches: 6, Cache L1: 5, Cache L2: 14, Cache L3: 9, TLB: 2, Cycles: 3, Operations: 3, Instructions: 8) Purpose: Comprehensive benchmark suite (PolyBench, NAS, custom benchmarks)

GPU Platforms: - NVIDIA GeForce GTX 680 (Kepler): Available HwPCs [to be specified] - NVIDIA GeForce GTX 750 (Maxwell): Available HwPCs [to be specified] - NVIDIA GeForce GTX 1070 (Pascal): 32 available HwPCs - NVIDIA GeForce RTX 2080 (Turing): 167 available HwPCs - NVIDIA GeForce RTX 3090 (Ampere): 167 available HwPCs - NVIDIA GeForce RTX 4080 (Ada Lovelace): 167 available HwPCs

Calibration and Measurement Standards: - Each configuration was executed 100 times to ensure statistical significance and account for system noise - Hardware counter measurements follow PAPI standardized preset event definitions for cross-platform consistency - GPU measurements adhere to NVIDIA's CUPTI (CUDA Profiling Tools Interface) standards - Measurement reliability considerations based on Weaver and McKee findings (coefficients of variation up to 1.07% under standard conditions)

Environmental or experimental conditions:

Controlled Execution Environment: - Dedicated computing nodes with minimal background processes - Consistent system configuration across all experimental runs - No concurrent user applications during measurements

CPU Experimental Conditions: - Platforms: Intel Xeon E5645 and Intel Xeon E5-4620 (32 physical cores) - Operating system: Linux - Thread configurations: 1 to 32 threads tested - Affinity policies: close, spread - Scheduling policies: static, dynamic, guided - Problem sizes: Scaled proportionally to memory hierarchy (L1, L2, L3 cache sizes, main memory) - Compiler optimization: GCC 9.2.0 with -O2 flag

GPU Experimental Conditions: - CUDA driver versions maintained consistent with CUDA toolkit versions - Problem sizes varied according to code region characteristics - Kernel execution isolated with synchronization barriers - Warm-up iterations performed before measurement collection

Systematic Parameter Variation: - All combinations of tuning parameters explored exhaustively for training data - Random sampling strategies avoided to ensure comprehensive coverage - Hardware counter multiplexing employed to overcome simultaneous monitoring limitations

Quality-assurance procedures performed on the data:

Statistical Validation: - 100 repetitions per configuration to ensure measurement reliability - Outlier detection and handling applied to identify anomalous measurements - Statistical significance testing performed on measured performance differences

Data Integrity Checks: - Null value identification and documentation - Zero-variance feature detection and removal - Range validation for HwPC values (ensuring physically plausible measurements) - Cross-validation of results across different execution runs

Model Validation: - Stratified 5-fold cross-validation to assess generalization - 30% holdout test set for final model evaluation - Comparison with baseline methods (OpenTuner) for performance validation - Architecture-specific validation on unseen code regions (NAS Parallel Benchmarks)

Reproducibility Measures: - Controlled experimental environment with documented system configurations - Fixed random seeds for reproducible data splits - Systematic documentation of all parameter combinations - Version control of all software dependencies

Identifier
DOI	https://doi.org/10.34810/data2897
Related Identifier	IsSupplementTo https://doi.org/10.1016/j.future.2025.108358
Metadata Access	https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/data2897

Provenance
Creator	HARUTYUNYAN GEVORGYAN, SUREN ; Cesar, Eduardo ; Sikora, Anna ; Filipovič, Jiří ; Alcaraz, Jordi
Publisher	CORA.Repositori de Dades de Recerca
Contributor	Harutyunyan Gevorgyan, Suren
Publication Year	2026
Funding Reference	https://ror.org/003x0zc53 PID2023-146193OB-I00 ; https://ror.org/01bg62x04 2021/SGR-00574
Rights	CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess	true
Contact	Harutyunyan Gevorgyan, Suren (Universitat Autònoma de Barcelona)

Representation
Resource Type	Program source code; Dataset
Format	text/tab-separated-values; text/csv; text/plain
Size	1858521; 1041573; 63049; 1705764; 117529; 474558; 857130; 1057914; 88073; 1845476; 531094; 906954; 2090000; 1022246; 66043; 61668265; 546174; 855594; 828831; 53248; 1442035; 390448; 753613; 912174; 58932; 1591035; 434982; 799113; 3851028; 247493; 168760644; 6499131; 602605033; 3406767; 3009632726; 3847; 68151; 98357; 1869660
Version	1.0
Discipline	Other
Spatial Coverage	Cerdanyola del Vallès, Catalonia, Spain (CPU experiments - Xeon E5645 and E5-4620) Brno, South Moravia, Czech Republic (GPU experiments)