Cloud Monitoring

Last updated: June 10, 2026

Tier	Deployment
Grow Enterprise	Cloud

Sisense Cloud employs a comprehensive monitoring and observability framework to ensure high availability, performance, and reliability of our platform. Our Cloud Operations team, including a dedicated Site Reliability Engineering (SRE) team, proactively monitors and optimizes system performance while ensuring a seamless experience for our customers.

Monitoring & Observability

Sisense Cloud gathers Metrics, Events, Logs, and Traces (MELT) through a Single Pane of Glass (SPOG) platform, ensuring end-to-end visibility across all deployments.

Metrics Collection: Every deployment includes Prometheus, which ships key metrics to local Grafana dashboards and SPOG.
Logging: Fluentd collects logs locally and ships them to SPOG for centralized analysis.
Application Performance Monitoring (APM): We actively integrate OpenTelemetry to enhance visibility into application-level performance.
Key Monitored Metrics:
- Infrastructure: CPU, memory, network, and disk usage.
- Kubernetes Cluster Health: Node and pod-level status, and resource utilization.
- Application-Level Metrics: In progress, with continuous expansion.
Alerting & Automated Remediation:
- Alerts are predefined for critical node and pod-level metrics.
- Automated remediation techniques are in place to minimize disruptions.

Proactive Incident Response

Sisense Cloud prioritizes a proactive approach to incident detection and resolution:

Incident Detection & Escalation:
- SPOG is used to manage business-critical alerts and ensure rapid response.
- Automated monitoring detects application and performance issues before they impact users.
Automated Remediation:
- Self-healing mechanisms and automated scripts help resolve common failures.
- Proactive scaling ensures optimal resource allocation.
Service Level Agreements (SLAs):
- Our SLAs are publicly available at Sisense Support Types & Response Times.
- Service Level Objectives (SLOs) are planned for definition after full APM rollout.

Site Reliability Engineering (SRE)

The Sisense Cloud Operations team is responsible for ensuring the reliability, scalability, and efficiency of the Sisense Cloud. The SRE team plays a crucial role in continuously improving platform performance and stability through engineering-driven operational excellence.

SRE Responsibilities:

Incident Prevention & Response:
- Implementing best practices for monitoring, alerting, and incident management.
- Ensuring rapid incident resolution and postmortem analysis for continuous improvement.
Scalability & Reliability Enhancements:
- Proactively optimizing system performance and infrastructure capacity.
- Adopting cloud-native reliability engineering practices.
Continuous Improvement:
- Automating manual operational tasks to reduce toil.
- Enhancing observability through APM, logs, and telemetry data.

Sisense Cloud is committed to delivering a reliable, high-performing platform by continuously evolving our monitoring and SRE capabilities.

For more details on self-service monitoring, see Monitoring Sisense on Linux.