Empleos actuales relacionados con Tenant Health - San José, San José - Armis

  • Facilities Coordinator

    hace 2 semanas


    San José, Costa Rica Equifax A tiempo completo

    The Facility Coordinator is responsible for assuring high standards of customer service, unwavering optimism, and integrity at all times, including timely response to and resolution of any client’s requests or concerns. Provide a high level of administrative and operational support to management on matters such as facility management and services. Ensure...

  • Lobby Host

    hace 1 semana


    San José, Costa Rica JLL A tiempo completo

    JLL supports the Whole You, personally and professionally. **Benefits to eligible employees, include**: Life, health and dental insurance for you. Annual Performance Bonus. Online platform with certified courses for your professional development. Wellbeing platform with personalized programs (yoga, meditation, exercise, meal plans, among...

Tenant Health

hace 6 horas


San José, San José, Costa Rica Armis A tiempo completo

The Tenant Health & Monitoring Engineer is a hands-on technical role responsible for proactively monitoring customer tenants, identifying anomalies, and resolving issues across performance, data quality, collector stability, and integration health. This role plays a critical part in Armis' shift from reactive issue handling to proactive detection, reducing Customer Team & Support escalations, improving the customer experience, and enhancing the overall health of the Armis platform.

The Engineer will work closely with other teams to contribute to early detection and automation efforts.

This is a technical, operational role ideal for candidates with backgrounds in Network Operations Center (NOC), Site Reliability troubleshooting, monitoring, and incident response, Technical Support, or monitoring teams.

Responsibilities:

  • Monitor tenant performance using Grafana, dashboards, reports, and internal logs.
  • Detect early signs of tenant degradation, including CPU/RAM spikes, container resets, throttling, and abnormal behavior.
  • Track and verify collector stability (online status, high CPU/RAM, disk usage, integration presence).
  • Monitor integration health (password/credential issues, connectivity failures, SPAN flow issues, network mapper gaps).
  • Identify data quality anomalies, duplication issues, and missing metadata trends.
  • Perform log analysis, SSH-based troubleshooting, and telemetry validation to determine root causes.
  • Resolve issues directly when possible and escalate appropriately when engineering escalation is required.
  • Follow established workflows, runbooks, SLAs, and escalation paths.
  • Provide updates and documentation in internal ticketing systems with high accuracy and clarity.
  • Support the team in refining monitoring processes, identifying gaps, and improving playbooks.
  • Generate daily or weekly summaries of monitored tenants, key findings, emerging patterns, and risks, and report as necessary.
  • Collaborate with other teams to resolve complex issues.
  • Assist in documenting troubleshooting procedures, new workflows, and technical runbooks.

Key Skills

  • Proficiency with monitoring/observability tools (Grafana, Mode, or similar).
  • Hands-on troubleshooting experience with Linux systems, logs, and network fundamentals.
  • Ability to interpret telemetry signals and correlate them with tenant behavior and system impact.
  • Familiarity with SPAN traffic, integrations, and collector-based architectures (preferred).
  • Excellent attention to detail in monitoring, documentation, and analysis.
  • Capable of prioritizing and managing multiple issues in a high-volume environment.
  • Strong sense of ownership and urgency, who seeks to identify problems before they impact customers.
  • Ability to explain technical issues in a structured, logical manner.

Min requirements:

  • 3+ years in a NOC, Technical Operations, SRE, Support Engineering, MSP Monitoring, or similar role.
  • Hands-on troubleshooting experience with Linux, logs, network paths, and system telemetry.
  • Experience responding to monitoring alerts, dashboards, or operational incidents.
  • Familiarity with observability tools (Grafana, Prometheus, Splunk, Elastic, etc.).
  • Ability to interpret complex telemetry (CPU, RAM, DB performance, throughput, packet loss, retries, API failure patterns)
  • Strong analytical abilities; capable of building dashboards, defining KPIs, and identifying systemic issues across tenants
  • Experience supporting enterprise environments and working with cross-functional technical teams.
  • Demonstrates a continuous-improvement mindset, believes that what was good enough yesterday must be improved today, and consistently pushes for higher standards, optimization, and refinement.

Preferred:

  • Direct NOC experience in monitoring network, cloud, or infrastructure environments
  • Familiarity with Armis components like collectors, related integrations, SPAN traffic, or similar architectures
  • Understanding of cybersecurity fundamentals and enterprise network topologies