How IoTSyn Works

Complete documentation of the mathematical models, probability distributions, and physical equations behind every dataset IoTSyn generates.

Technical Report v3.1 10 Academic References

1. Design Philosophy

IoTSyn generates synthetic IoT data using fully documented, physics-based statistical models. Unlike approaches that rely on generative AI models (GANs, VAEs, diffusion models), IoTSyn produces data from explicit mathematical equations where every parameter can be inspected, adjusted, and justified.

This transparency is critical for research: when training a machine learning model on synthetic data, the researcher must understand exactly what statistical properties the training data exhibits — and why.

Key Principles

Transparent: Every distribution, correlation, and temporal pattern is derived from a documented equation with explicit parameters.
Reproducible: Same seed + parameters = identical output. All random state is instance-level.
Grounded: Models are based on physics (thermodynamics, fluid dynamics), physiology (circadian rhythms), and network theory.

2. Statistical Engine

The core engine provides probability distributions and stochastic processes used by all domain generators.

2.1 Gaussian Distribution (Box-Muller Transform)

Normal variates are generated using the Box-Muller transform [1]. Given U₁, U₂ drawn independently from Uniform(0,1):

Z₀ = √(−2 ln U₁) · cos(2π U₂) Z₁ = √(−2 ln U₁) · sin(2π U₂) X = μ + σ · Z

The spare value Z₁ is cached as an instance variable (not static) to prevent cross-contamination between generator instances.

2.2 Poisson Distribution

For λ < 30, Knuth's algorithm [2] is used. For λ ≥ 30, the normal approximation N(λ, √λ) is applied via the Central Limit Theorem.

2.3 Additional Distributions

Distribution Method Use Case
ExponentialX = −ln(U) / λInter-arrival times, connection durations
Log-NormalY = exp(N(μ, σ))Packet sizes, response times
GammaMarsaglia & Tsang [3]Aggregate traffic, waiting times
WeibullX = λ·(−ln(U))^(1/k)Equipment failure modeling

2.4 Bivariate Correlation (Cholesky Method)

Correlated pairs are generated using bivariate Cholesky decomposition [4]:

Z_x = (X − μ_x) / σ_x Z_y = ρ · Z_x + √(1 − ρ²) · Z_ind Y = μ_y + σ_y · Z_y where Z_ind ~ N(0,1) independently This guarantees Corr(X, Y) = ρ in expectation.

2.5 Autoregressive Noise — AR(1)

Temporally correlated noise prevents unrealistic jumps between consecutive readings:

ε(t) = φ · ε(t−1) + η(t) where η ~ N(0, σ²(1 − φ²)) Marginal variance: σ² Autocorrelation at lag k: φ^k Typical: φ = 0.85 (temperature), φ = 0.9 (vital signs)

2.6 Markov Chain Transitions

Discrete states (occupancy, activity, attack phases) transition according to time-dependent probability matrices. The transition P(X_{n+1} = j | X_n = i) = p_ij is sampled using the inverse CDF method.

3. Smart Home Generator

Generates indoor environment sensor data: temperature, humidity, CO₂, light, occupancy, and HVAC status.

Temperature Model

Indoor temperature follows a multi-harmonic Fourier decomposition with HVAC feedback:

T(t) = T_set + A₁·sin(ω₁t + φ₁) + A₂·sin(2ω₁t + φ₂) + T_season(d) + ε(t) A₁ = 2.5°C (primary daily cycle, peak at 15:00) A₂ = 0.75°C (heating/cooling asymmetry) T_season = 2.0°C amplitude over 365-day cycle ε(t) = AR(1) noise (φ = 0.85, σ = 0.3°C)

CO₂ Model (Mass-Balance ODE)

Concentration follows a discrete mass-balance ordinary differential equation [5, 6]:

dC/dt = (n · G − Q · (C − C_out)) / V n = number of occupants G = 5.0 ppm·m³/s per person (generation rate) Q = ventilation rate (room-dependent, m³/s) V = room volume (m³) C_out = 420 ppm (outdoor baseline)

HVAC Model

Three-state deadband thermostat (OFF / HEATING / COOLING) with ±1.5°C hysteresis prevents rapid cycling [5].

Humidity

Cholesky-correlated with temperature (ρ = −0.65) with room-specific modifiers: kitchen +8%, bathroom +15%.

Occupancy

Time-dependent Markov chain with hourly transition probabilities supporting 1–8 occupants.

4. IoT Network Security Generator

Generates network traffic with attack patterns for intrusion detection research.

Normal traffic: LogNormal packet sizes (μ_ln = 6.0, σ_ln = 0.7, median ≈ 403 bytes) and Exponential connection durations (λ = 0.5).

Attacks: Arrive in bursts via a Markov state machine with four phases: reconnaissance → escalation → peak → cooldown. Five attack types (DoS, DDoS, Scan, Brute Force, Botnet) have distinct traffic signatures.

5. Predictive Maintenance Generator

Models industrial equipment health degradation for RUL (Remaining Useful Life) prediction.

Degradation Model

D(t) = 1 − exp(−(t / L)^β) L = design life (hours, machine-type dependent) β = Weibull shape parameter (1.8–3.0) β > 1 models wear-out failures [7]

Derived Sensors

Temperature: T = T_ambient + T_load·(RPM/RPM_rated)² + 30·D² + AR(1) Vibration: V = V_base · (1 + 8·D²) + imbalance [ISO 10816] Current: I = I_rated · (Load/100) · (1 + 0.3·D) Pressure: P = P_design · (1 − 0.3·D) (seal wear model)

6. Medical IoT Generator

Patient vital signs with age-dependent baselines and circadian variation [9, 10].

Heart Rate: HR(t) = HR_base(age) + HR_circadian(h) + HR_activity + ε(t) HR_circadian = 5 bpm amplitude, nadir at 3–4 AM Blood Pressure: Correlated with HR (ρ ≈ 0.4 systolic) Includes morning surge model SpO₂: Sleep apnea events for elderly (15% probability, 1–5 AM) Glucose: Fasting baseline + Gaussian meal spikes at 8:00, 13:00, 19:00 Diabetic variant: higher baseline, larger spikes Health Status: Computed via NEWS2-inspired scoring [8]

7. IIoT Network Traffic Generator

Industrial OT protocol traffic (Modbus TCP, OPC UA, DNP3, BACnet, EtherNet/IP) with device roles (PLC, HMI, SCADA Server, RTU, Historian). Normal traffic uses protocol-specific packet sizes reflecting periodic polling patterns. Attack types include Man-in-the-Middle, Replay, False Data Injection, DoS, and Reconnaissance.

8. Connected Vehicle Generator

Vehicle telemetry using a driving state machine (stopped → accelerating → cruising → braking) modulated by time-dependent traffic density. Position via dead reckoning with GPS noise. Engine RPM follows a 5-gear model. Fuel consumption is a linear function of speed and acceleration.

9. Reproducibility

Every dataset includes a seed value stored in the CSV metadata header and the database. The same seed + domain + row count + parameters produce identical output.

Implementation: PHP's Mersenne Twister (mt_rand) seeded via constructor. All random state is instance-level — no static variables — preventing cross-contamination between generators. Seed is derived from CRC32 of input parameters + microsecond timestamp.

Limitation: PHP's mt_rand() may vary across major PHP versions. IoTSyn v3.1 is validated on PHP 8.1+.

10. References

  1. Box, G.E.P. & Muller, M.E. (1958). A Note on the Generation of Random Normal Deviates. Annals of Mathematical Statistics, 29(2), 610–611.
  2. Knuth, D.E. (1997). The Art of Computer Programming, Vol 2: Seminumerical Algorithms, 3rd ed. Addison-Wesley.
  3. Marsaglia, G. & Tsang, W.W. (2000). A simple method for generating gamma variables. ACM TOMS, 26(3), 363–372.
  4. Gentle, J.E. (2009). Computational Statistics. Springer, Chapter 4.
  5. ASHRAE (2019). ASHRAE Handbook: HVAC Applications, Chapter 63.
  6. Persily, A.K. & de Jonge, L. (2017). Carbon dioxide generation rates for building occupants. Indoor Air, 27(5), 868–879.
  7. ISO 10816-1:1995. Mechanical vibration — Evaluation of machine vibration by measurements on non-rotating parts.
  8. Royal College of Physicians (2017). National Early Warning Score (NEWS) 2. RCP London.
  9. Refinetti, R. & Menaker, M. (1992). The circadian rhythm of body temperature. Physiology & Behavior, 51(3), 613–637.
  10. Venditti, F.J. et al. (2005). Circadian variation of heart rate variability. J Cardiovasc Electrophysiol, 16(1), 27–31.

Ready to Generate?

All models documented above are available in the IoTSyn generator.