Modular Probe Engine (Plug-in Architecture)

Problem (v1):
Ping tests are hard-coded into the service (Layer-7 Application)

Improvement:
Introduce a probe interface, then implement:

  • HTTPProbe
  • gRPCProbe
  • TCPConnectProbe
  • DatabaseProbe (ClickHouse / Redis ping), etc.

Benefit:

  • Easily extendable
  • Other teams can contribute probes
  • Keeps Insight maintainable long-term

Baseline Normalization & Noise Reduction

Problem (v1):

Raw network metrics are noisy (daily variation, cross-region jitter).

Improvement:

Add:

  • Baseline calculation per hour-of-day
  • Deviation-based alerts instead of static thresholds
  • Jitter smoothing (EWMA)
  • Automatic outlier filtering

Benefit:

  • Far fewer false alerts
  • Better signal quality
  • Helps identify true performance regressions

Probe Scheduling Improvements

Problem (v1):

Probes run at fixed intervals; not load aware.

Improvement:

  • Adaptive frequency (increase probing when degradation detected)
  • Randomized intervals to avoid thundering herd
  • Sliding window metrics

Benefit:
Lower overhead + more accurate detection.


Historical Trend Analysis

Problem (v1):

Only real-time Prometheus metrics, no long-term analytics.

Improvement:

  • Export metrics to Thanos for long-term retention
  • Add trend dashboards (weekly/monthly latency patterns)
  • Correlate network latency with deployment events

Benefit:

  • Understand long-term baseline
  • Detect slow regressions
  • Improves incident root-cause analysis