This dataset contains Hardware Performance Counters (HwPCs) measurements collected from parallel code regions executing on heterogeneous High Performance Computing platforms. The dataset includes comprehensive HwPC data from CPU architectures (Intel Xeon E5645 and Intel Xeon E5-4620) and GPU architectures (NVIDIA GeForce GTX 680 Kepler, GTX 750 Maxwell, GTX 1070 Pascal, RTX 2080 Turing, RTX 3090 Ampere, and RTX 4080 Ada Lovelace). Data was collected from multiple benchmark suites including STREAM, PolyBench, NAS Parallel Benchmarks, and GPU kernels (Convolution, Coulomb Sum, N-body, Transposition, GEMM, Reduction, Biconjugate Gradient, and Hotspot). The dataset encompasses measurements across varying problem sizes, thread configurations, affinity policies, scheduling strategies, and chunk sizes for OpenMP regions, as well as tuning parameters for GPU kernels including work-group size, work-item coarsening, memory caching strategies, tile sizes, loop unrolling, and vectorization. This data supports machine learning-based optimization of parallel applications through automated selection of minimal Hardware Performance Counter sets for code region identification and tuning parameter optimization.
Python, 3.7
GCC compiler, 9.2.0
METHODOLOGICAL INFORMATION
- Description of methods used for collection-generation of data:
CPU Data Collection:
Hardware Performance Counter data was collected using the Performance Application Programming Interface (PAPI) on two Intel Xeon platforms:
- Intel Xeon E5645 (STREAM benchmark data for memory bandwidth and latency analysis)
- Intel Xeon E5-4620 (32 cores - comprehensive benchmark suite)
Code regions were extracted from:
- STREAM benchmark (Copy, Scale, Sum, Triad) - memory bandwidth characterization
- PolyBench suite (12 regions) - synthetic benchmarks for scientific/engineering applications
- NAS Parallel Benchmarks (7 regions from BT, CG, and LU benchmarks) - CPU-bound parallel workload performance
- Custom benchmarks (Collatz sequences, Friendly numbers) - varying computational loads
The systematic dataset construction methodology varied OpenMP parameters including number of threads (up to 32), thread affinity policies (close, spread), scheduling strategies (static, dynamic, guided), and chunk sizes. Problem sizes were proportionally scaled to memory hierarchy levels (L1, L2, L3 caches, and main memory) based on number of physical cores and processors. For each configuration, PAPI preset event groups were measured across multiple executions, with compatible events grouped to overcome hardware monitoring limitations. Each unique combination of HwPC group, problem size, and configuration parameter was executed 100 times for statistical significance.
Total executions per platform were calculated as: E = S × T × P × OP × N, where S = HwPC sets (12), T = thread configurations (32), P = problem sizes (29), OP = OpenMP tuning parameter combinations (11), and N = repetitions (100), yielding 12,249,600 executions for the Xeon E5-4620 platform.
GPU Data Collection:
Hardware Performance Counter data for GPU experiments was collected using the Kernel Tuning Toolkit (KTT) across six NVIDIA GPU architectures:
- GeForce GTX 680 (Kepler) - 5 base code regions
- GeForce GTX 750 (Maxwell) - 5 base code regions
- GeForce GTX 1070 (Pascal) - 6 code regions (adds Reduction)
- GeForce RTX 2080 (Turing) - 7 code regions (adds Biconjugate Gradient, Hotspot)
- GeForce RTX 3090 (Ampere) - 5 base code regions
- GeForce RTX 4080 (Ada Lovelace) - 6 code regions (adds Biconjugate Gradient)
Base code regions (present across all or most architectures): Convolution, Coulomb Sum, N-body, Transposition, GEMM
Additional regions: Reduction (Pascal), Biconjugate Gradient (Turing, Ada Lovelace), Hotspot (Turing)
All GPU code regions use CUDA as the parallelization model.
Tuning parameters varied across code regions and included: work-group size, work-item coarsening, local memory caching, private memory caching, tile size, loop unrolling, local memory padding, and explicit vectorization.
Compilation and Execution Environment:
- CPU benchmarks: GCC version 9.2.0 with -O2 optimization flag
- GPU benchmarks:
* GTX 680 and GTX 750: NVCC with CUDA [version to be specified]
* GTX 1070 and RTX 2080: NVCC with CUDA 10.1 (driver 418.67)
* RTX 3090: NVCC with CUDA 12.1 (driver 535.183)
* RTX 4080: NVCC with CUDA 12.3 (driver 545.23)
References:
- Browne, S., et al. (2000). A Portable Programming Interface for Performance Evaluation on Modern Processors. International Journal of High Performance Computing Applications.
- Petrovič, F., et al. (2020). Benchmarking auto-tuning search spaces for GPU kernels [original benchmark reference]
- Petrovič, F., et al. (2023). Kernel Tuning Toolkit: A flexible infrastructure for efficient auto-tuning.
- Methods for processing the data:
The collected raw Hardware Performance Counter measurements underwent systematic preprocessing:
-
Normalization: HwPC values were normalized to ensure comparability across different execution scales and architectures.
-
Data cleaning: Null values and zero-variance features were identified and removed to eliminate uninformative counters.
-
Feature engineering: For CPU datasets, the performance index Pi(X) was calculated using the formula: Pi(X) = (X × T_t(X)²) / T_t(1), where X is the number of threads, T_t(X) is execution time with X threads, relating execution time to resource efficiency.
-
Optimal configuration labeling: For each code region and tuning parameter combination, optimal configurations were identified as those minimizing the objective function (Pi(X) for number of threads, execution time for affinity/scheduling/chunk size). These served as ground truth labels for supervised learning.
Machine Learning Pipeline:
The processed data was used to train separate ensemble models for: (1) code region identification, and (2) tuning parameter optimization for each identified region. Model performance was evaluated using accuracy, precision, recall, F1-score, and ROC AUC metrics.
- Instrument- or software- specific information needed to interpret the data:
Required Software:
- PAPI (Performance Application Programming Interface) - version compatible with target CPU architecture
Available at: https://icl.utk.edu/papi/
-
Kernel Tuning Toolkit (KTT) - for GPU data collection
Available at: https://github.com/Fillo7/KTT
-
Python 3.7+ with libraries:
- scikit-learn (Logistic Regression, Random Forest)
- xgboost
- pytorch-tabnet (TabNet implementation)
- numpy, pandas (data manipulation)
-
matplotlib, seaborn (visualization)
-
GCC compiler 9.2.0 (CPU benchmarks)
- NVIDIA CUDA Toolkit:
- CUDA 10.1 for GTX 1070 and RTX 2080
- CUDA 12.1 for RTX 3090
-
CUDA 12.3 for RTX 4080
-
OpenMP runtime library (for parallel CPU execution)
-
Archive extraction tools:
- 7-Zip or p7zip for multi-part ZIP archives (.zip.001, .zip.002, etc.) and .7z files
- On Linux/macOS: cat command to concatenate multi-part archives before extraction
File Formats:
- HwPC measurements: CSV format with counter values as numerical data
- Compressed archives: Multi-part ZIP (.zip.001, .zip.002, etc.) and 7z format
- Model checkpoints: Pickle format
- Configuration files: JSON format
The dataset uses open, standard formats (CSV for tabular data) to ensure accessibility without proprietary software.
- Instruments, calibration and standards information:
Hardware Platforms:
CPU Platforms:
- Processor 1: Intel Xeon E5645 (used for STREAM benchmark data)
Purpose: Memory bandwidth and latency characteristics analysis
- Processor 2: Intel Xeon E5-4620 (32 cores)
Memory hierarchy: L1, L2, L3 caches and main memory
PAPI preset events: 50 counters grouped by category (Branches: 6, Cache L1: 5, Cache L2: 14, Cache L3: 9, TLB: 2, Cycles: 3, Operations: 3, Instructions: 8)
Purpose: Comprehensive benchmark suite (PolyBench, NAS, custom benchmarks)
GPU Platforms:
- NVIDIA GeForce GTX 680 (Kepler): Available HwPCs [to be specified]
- NVIDIA GeForce GTX 750 (Maxwell): Available HwPCs [to be specified]
- NVIDIA GeForce GTX 1070 (Pascal): 32 available HwPCs
- NVIDIA GeForce RTX 2080 (Turing): 167 available HwPCs
- NVIDIA GeForce RTX 3090 (Ampere): 167 available HwPCs
- NVIDIA GeForce RTX 4080 (Ada Lovelace): 167 available HwPCs
Calibration and Measurement Standards:
- Each configuration was executed 100 times to ensure statistical significance and account for system noise
- Hardware counter measurements follow PAPI standardized preset event definitions for cross-platform consistency
- GPU measurements adhere to NVIDIA's CUPTI (CUDA Profiling Tools Interface) standards
- Measurement reliability considerations based on Weaver and McKee findings (coefficients of variation up to 1.07% under standard conditions)
- Environmental or experimental conditions:
Controlled Execution Environment:
- Dedicated computing nodes with minimal background processes
- Consistent system configuration across all experimental runs
- No concurrent user applications during measurements
CPU Experimental Conditions:
- Platforms: Intel Xeon E5645 and Intel Xeon E5-4620 (32 physical cores)
- Operating system: Linux
- Thread configurations: 1 to 32 threads tested
- Affinity policies: close, spread
- Scheduling policies: static, dynamic, guided
- Problem sizes: Scaled proportionally to memory hierarchy (L1, L2, L3 cache sizes, main memory)
- Compiler optimization: GCC 9.2.0 with -O2 flag
GPU Experimental Conditions:
- CUDA driver versions maintained consistent with CUDA toolkit versions
- Problem sizes varied according to code region characteristics
- Kernel execution isolated with synchronization barriers
- Warm-up iterations performed before measurement collection
Systematic Parameter Variation:
- All combinations of tuning parameters explored exhaustively for training data
- Random sampling strategies avoided to ensure comprehensive coverage
- Hardware counter multiplexing employed to overcome simultaneous monitoring limitations
- Quality-assurance procedures performed on the data:
Statistical Validation:
- 100 repetitions per configuration to ensure measurement reliability
- Outlier detection and handling applied to identify anomalous measurements
- Statistical significance testing performed on measured performance differences
Data Integrity Checks:
- Null value identification and documentation
- Zero-variance feature detection and removal
- Range validation for HwPC values (ensuring physically plausible measurements)
- Cross-validation of results across different execution runs
Model Validation:
- Stratified 5-fold cross-validation to assess generalization
- 30% holdout test set for final model evaluation
- Comparison with baseline methods (OpenTuner) for performance validation
- Architecture-specific validation on unseen code regions (NAS Parallel Benchmarks)
Reproducibility Measures:
- Controlled experimental environment with documented system configurations
- Fixed random seeds for reproducible data splits
- Systematic documentation of all parameter combinations
- Version control of all software dependencies