Cloud Monitoring
Sisense Cloud employs a comprehensive monitoring and observability framework to ensure high availability, performance, and reliability of our platform. Our Cloud Operations team, including a dedicated Site Reliability Engineering (SRE) team, proactively monitors and optimizes system performance while ensuring a seamless experience for our customers.
Monitoring & Observability
Sisense Cloud gathers Metrics, Events, Logs, and Traces (MELT) through a Single Pane of Glass (SPOG) platform, ensuring end-to-end visibility across all deployments.
-
Metrics Collection: Every deployment includes Prometheus, which ships key metrics to local Grafana dashboards and SPOG.
-
Logging: Fluentd collects logs locally and ships them to SPOG for centralized analysis.
-
Application Performance Monitoring (APM): We actively integrate OpenTelemetry to enhance visibility into application-level performance.
-
Key Monitored Metrics:
-
Infrastructure: CPU, memory, network, and disk usage.
-
Kubernetes Cluster Health: Node and pod-level status, and resource utilization.
-
Application-Level Metrics: In progress, with continuous expansion.
-
-
Alerting & Automated Remediation:
-
Alerts are predefined for critical node and pod-level metrics.
-
Automated remediation techniques are in place to minimize disruptions.
-
Proactive Incident Response
Sisense Cloud prioritizes a proactive approach to incident detection and resolution:
-
Incident Detection & Escalation:
-
SPOG is used to manage business-critical alerts and ensure rapid response.
-
Automated monitoring detects application and performance issues before they impact users.
-
-
Automated Remediation:
-
Self-healing mechanisms and automated scripts help resolve common failures.
-
Proactive scaling ensures optimal resource allocation.
-
-
Service Level Agreements (SLAs):
-
Our SLAs are publicly available at Sisense Support Types & Response Times.
-
Service Level Objectives (SLOs) are planned for definition after full APM rollout.
-
Site Reliability Engineering (SRE)
The Sisense Cloud Operations team is responsible for ensuring the reliability, scalability, and efficiency of the Sisense Cloud. The SRE team plays a crucial role in continuously improving platform performance and stability through engineering-driven operational excellence.
SRE Responsibilities:
-
Incident Prevention & Response:
-
Implementing best practices for monitoring, alerting, and incident management.
-
Ensuring rapid incident resolution and postmortem analysis for continuous improvement.
-
-
Scalability & Reliability Enhancements:
-
Proactively optimizing system performance and infrastructure capacity.
-
Adopting cloud-native reliability engineering practices.
-
-
Continuous Improvement:
-
Automating manual operational tasks to reduce toil.
-
Enhancing observability through APM, logs, and telemetry data.
-
Sisense Cloud is committed to delivering a reliable, high-performing platform by continuously evolving our monitoring and SRE capabilities.
For more details on self-service monitoring, see Monitoring Sisense on Linux.