How IoTSyn Works
Complete documentation of the mathematical models, probability distributions, and physical equations behind every dataset IoTSyn generates.
Contents
1. Design Philosophy
IoTSyn generates synthetic IoT data using fully documented, physics-based statistical models. Unlike approaches that rely on generative AI models (GANs, VAEs, diffusion models), IoTSyn produces data from explicit mathematical equations where every parameter can be inspected, adjusted, and justified.
This transparency is critical for research: when training a machine learning model on synthetic data, the researcher must understand exactly what statistical properties the training data exhibits — and why.
Key Principles
2. Statistical Engine
The core engine provides probability distributions and stochastic processes used by all domain generators.
2.1 Gaussian Distribution (Box-Muller Transform)
Normal variates are generated using the Box-Muller transform [1]. Given U₁, U₂ drawn independently from Uniform(0,1):
The spare value Z₁ is cached as an instance variable (not static) to prevent cross-contamination between generator instances.
2.2 Poisson Distribution
For λ < 30, Knuth's algorithm [2] is used. For λ ≥ 30, the normal approximation N(λ, √λ) is applied via the Central Limit Theorem.
2.3 Additional Distributions
| Distribution | Method | Use Case |
|---|---|---|
| Exponential | X = −ln(U) / λ | Inter-arrival times, connection durations |
| Log-Normal | Y = exp(N(μ, σ)) | Packet sizes, response times |
| Gamma | Marsaglia & Tsang [3] | Aggregate traffic, waiting times |
| Weibull | X = λ·(−ln(U))^(1/k) | Equipment failure modeling |
2.4 Bivariate Correlation (Cholesky Method)
Correlated pairs are generated using bivariate Cholesky decomposition [4]:
2.5 Autoregressive Noise — AR(1)
Temporally correlated noise prevents unrealistic jumps between consecutive readings:
2.6 Markov Chain Transitions
Discrete states (occupancy, activity, attack phases) transition according to time-dependent probability matrices. The transition P(X_{n+1} = j | X_n = i) = p_ij is sampled using the inverse CDF method.
3. Smart Home Generator
Generates indoor environment sensor data: temperature, humidity, CO₂, light, occupancy, and HVAC status.
Temperature Model
Indoor temperature follows a multi-harmonic Fourier decomposition with HVAC feedback:
CO₂ Model (Mass-Balance ODE)
Concentration follows a discrete mass-balance ordinary differential equation [5, 6]:
HVAC Model
Three-state deadband thermostat (OFF / HEATING / COOLING) with ±1.5°C hysteresis prevents rapid cycling [5].
Humidity
Cholesky-correlated with temperature (ρ = −0.65) with room-specific modifiers: kitchen +8%, bathroom +15%.
Occupancy
Time-dependent Markov chain with hourly transition probabilities supporting 1–8 occupants.
4. IoT Network Security Generator
Generates network traffic with attack patterns for intrusion detection research.
Normal traffic: LogNormal packet sizes (μ_ln = 6.0, σ_ln = 0.7, median ≈ 403 bytes) and Exponential connection durations (λ = 0.5).
Attacks: Arrive in bursts via a Markov state machine with four phases: reconnaissance → escalation → peak → cooldown. Five attack types (DoS, DDoS, Scan, Brute Force, Botnet) have distinct traffic signatures.
5. Predictive Maintenance Generator
Models industrial equipment health degradation for RUL (Remaining Useful Life) prediction.
Degradation Model
Derived Sensors
6. Medical IoT Generator
Patient vital signs with age-dependent baselines and circadian variation [9, 10].
7. IIoT Network Traffic Generator
Industrial OT protocol traffic (Modbus TCP, OPC UA, DNP3, BACnet, EtherNet/IP) with device roles (PLC, HMI, SCADA Server, RTU, Historian). Normal traffic uses protocol-specific packet sizes reflecting periodic polling patterns. Attack types include Man-in-the-Middle, Replay, False Data Injection, DoS, and Reconnaissance.
8. Connected Vehicle Generator
Vehicle telemetry using a driving state machine (stopped → accelerating → cruising → braking) modulated by time-dependent traffic density. Position via dead reckoning with GPS noise. Engine RPM follows a 5-gear model. Fuel consumption is a linear function of speed and acceleration.
9. Reproducibility
Every dataset includes a seed value stored in the CSV metadata header and the database. The same seed + domain + row count + parameters produce identical output.
Implementation: PHP's Mersenne Twister (mt_rand) seeded via constructor. All random state is instance-level — no static variables — preventing cross-contamination between generators. Seed is derived from CRC32 of input parameters + microsecond timestamp.
Limitation: PHP's mt_rand() may vary across major PHP versions. IoTSyn v3.1 is validated on PHP 8.1+.
10. References
- Box, G.E.P. & Muller, M.E. (1958). A Note on the Generation of Random Normal Deviates. Annals of Mathematical Statistics, 29(2), 610–611.
- Knuth, D.E. (1997). The Art of Computer Programming, Vol 2: Seminumerical Algorithms, 3rd ed. Addison-Wesley.
- Marsaglia, G. & Tsang, W.W. (2000). A simple method for generating gamma variables. ACM TOMS, 26(3), 363–372.
- Gentle, J.E. (2009). Computational Statistics. Springer, Chapter 4.
- ASHRAE (2019). ASHRAE Handbook: HVAC Applications, Chapter 63.
- Persily, A.K. & de Jonge, L. (2017). Carbon dioxide generation rates for building occupants. Indoor Air, 27(5), 868–879.
- ISO 10816-1:1995. Mechanical vibration — Evaluation of machine vibration by measurements on non-rotating parts.
- Royal College of Physicians (2017). National Early Warning Score (NEWS) 2. RCP London.
- Refinetti, R. & Menaker, M. (1992). The circadian rhythm of body temperature. Physiology & Behavior, 51(3), 613–637.
- Venditti, F.J. et al. (2005). Circadian variation of heart rate variability. J Cardiovasc Electrophysiol, 16(1), 27–31.
Ready to Generate?
All models documented above are available in the IoTSyn generator.