Cloud Monitoring

Sisense Cloud employs a comprehensive monitoring and observability framework to ensure high availability, performance, and reliability of our platform. Our Cloud Operations team, including a dedicated Site Reliability Engineering (SRE) team, proactively monitors and optimizes system performance while ensuring a seamless experience for our customers.

Monitoring & Observability

Sisense Cloud gathers Metrics, Events, Logs, and Traces (MELT) through a Single Pane of Glass (SPOG) platform, ensuring end-to-end visibility across all deployments.

  • Metrics Collection: Every deployment includes Prometheus, which ships key metrics to local Grafana dashboards and SPOG.

  • Logging: Fluentd collects logs locally and ships them to SPOG for centralized analysis.

  • Application Performance Monitoring (APM): We actively integrate OpenTelemetry to enhance visibility into application-level performance.

  • Key Monitored Metrics:

    • Infrastructure: CPU, memory, network, and disk usage.

    • Kubernetes Cluster Health: Node and pod-level status, and resource utilization.

    • Application-Level Metrics: In progress, with continuous expansion.

  • Alerting & Automated Remediation:

    • Alerts are predefined for critical node and pod-level metrics.

    • Automated remediation techniques are in place to minimize disruptions.

Proactive Incident Response

Sisense Cloud prioritizes a proactive approach to incident detection and resolution:

  • Incident Detection & Escalation:

    • SPOG is used to manage business-critical alerts and ensure rapid response.

    • Automated monitoring detects application and performance issues before they impact users.

  • Automated Remediation:

    • Self-healing mechanisms and automated scripts help resolve common failures.

    • Proactive scaling ensures optimal resource allocation.

  • Service Level Agreements (SLAs):

Site Reliability Engineering (SRE)

The Sisense Cloud Operations team is responsible for ensuring the reliability, scalability, and efficiency of the Sisense Cloud. The SRE team plays a crucial role in continuously improving platform performance and stability through engineering-driven operational excellence.

SRE Responsibilities:

  • Incident Prevention & Response:

    • Implementing best practices for monitoring, alerting, and incident management.

    • Ensuring rapid incident resolution and postmortem analysis for continuous improvement.

  • Scalability & Reliability Enhancements:

    • Proactively optimizing system performance and infrastructure capacity.

    • Adopting cloud-native reliability engineering practices.

  • Continuous Improvement:

    • Automating manual operational tasks to reduce toil.

    • Enhancing observability through APM, logs, and telemetry data.

Sisense Cloud is committed to delivering a reliable, high-performing platform by continuously evolving our monitoring and SRE capabilities.

For more details on self-service monitoring, see Monitoring Sisense on Linux.