Site Reliability Engineer II
hace 1 semana
Costa Rica
Company Overview
At Zuora, we power the world's shift to Modern Business. We're helping people and companies subscribe to a better way of doing business—one that's built on recurring relationships instead of one-time transactions, creating more value for customers, companies, and the planet.
As pioneers of the Subscription Economy, our platform and expertise help the world's most innovative organizations—from disruptive startups to global enterprises—monetize new business models, nurture long-term subscriber relationships, and optimize their digital experiences.
Join us as we transform industries and shape the future of how businesses grow.
Our Tech Stack: Linux Administration, Python, Docker, Kubernetes, MySQL, Kafka, ActiveMQ, Tomcat App & Web, Oracle, Load Balancers, REDIS Cache, Debezium, AWS, WAF, LBs, Jenkins, GitOps, Terraform, Ansible, Puppet, Prometheus, Grafana, Open Telemetry
This is a hybrid position, so you'll work both remotely and in the office.
The Team & Role
Join Zuora's high-impact Operations team and help power the backbone of our industry-leading SaaS platform.
In this role, you'll be at the center of maintaining and enhancing the reliability, scalability, and performance of Zuora's core systems — ensuring our customers around the world enjoy a seamless experience every time.
We're looking for an engineer who thrives on solving complex operational challenges, loves building automation-first solutions, and is passionate about driving innovation through AI and modern infrastructure practices.
Make a measurable impact: Your work directly affects system uptime, performance, and customer satisfaction across Zuora's global platform.
Build the future of operations: Shape how we leverage AI/ML for predictive monitoring, self-healing systems, and intelligent automation.
Collaborate across disciplines: Partner with Product Engineering, Customer Support, Deal Desk, Global Services, and Sales to deliver a world-class, customer-centric operational model.
Work with cutting-edge tech: From Kubernetes to Kafka, Terraform to OpenTelemetry — you'll use (and improve) the tools that define modern cloud infrastructure.
Our Tech Stack: Linux Administration, Python, Docker, Kubernetes, MySQL, Kafka, ActiveMQ, Tomcat App & Web, Oracle, Load Balancers, REDIS Cache, Debezium, AWS, WAF, LBs, Jenkins, GitOps, Terraform, Ansible, Puppet, Prometheus, Grafana, Open Telemetry
What you'll do
- Design and implement intelligent automation for infrastructure lifecycle management — including self-healing, anomaly detection, and automated remediation using IaC and AI-driven tooling.
- Apply AI/ML techniques for predictive monitoring and proactive performance optimization to prevent outages before they happen.
- Lead complex incident response and root cause analysis (RCA) efforts, embedding automation and learning into postmortems.
- Identify and remove reliability bottlenecks using dynamic scaling, telemetry instrumentation, and automated tuning.
- Continuously enhance runbooks and playbooks by integrating machine learning insights and automating manual tasks.
- Stay on the cutting edge of AIOps, distributed systems, and cloud-native reliability practices — and bring those learnings to influence strategic engineering decisions.
Your experience
- Strong hands-on experience in Linux Administration and Python Development.
- Experience working with Agentic AI or multi-agent frameworks to amplify operational capabilities.
- Deep expertise with Docker and Kubernetes, managing scalable, high-availability environments.
- Familiarity with Kafka, ActiveMQ, MySQL, Oracle, Redis, and modern caching/messaging systems.
- Understanding of AI/ML-based anomaly detection and predictive operations.
- Proven ability in incident management, RCA, and building systems that prevent recurrence.
- Experience designing and maintaining CI/CD pipelines, with strong observability and reliability focus.
- Proficiency with Prometheus, Grafana, and OpenTelemetry for real-time monitoring and anomaly detection.
- A continuous learning mindset and a passion for automation, innovation, and operational excellence.
- 1+ years of experience in a SaaS or cloud-native environment.
Nice to have
- Experience with Jenkins, Terraform, and advanced infrastructure-as-code practices.
Certifications:
- Red Hat Certified System Administrator (RHCSA)
- AWS / Azure / GCP Certifications
- Python Institute PCAP (Certified Associate in Python Programming)
- Docker Certified Associate (DCA) or Certified Kubernetes Administrator (CKA)
- SRE or advanced operations certifications
#ZEOLife at Zuora
At Zuora, we're constantly learning, innovating, and growing. Our people—known as ZEOs—are empowered to take ownership, challenge the status quo, and make a lasting impact.
We collaborate deeply, think boldly, and support one another to make what's next possible—for our customers, our communities, and each other.
We offer:
- Competitive compensation, bonus opportunities, and retirement programs
- Comprehensive medical, dental, and vision coverage
- Generous, flexible time off
- Paid holidays, wellness days, and a company-wide year-end break
- 6 months of fully paid parental leave
- Learning & development stipend
- Opportunities to give back, including volunteer time and donation matching
- Mental wellbeing resources and support
(Benefits may vary by location; details will be shared during the interview process.)
Location & Work Arrangements
Zuora teams are empowered to design flexible, intentional ways of working. Whether remote, hybrid, or in-office, we balance flexibility with accountability—to each other, our customers, and our mission.
For most roles, you'll have the freedom to work where you're most productive while staying connected to your team and the broader ZEO community.
Our Commitment to an Inclusive Workplace
Think, be and do you At Zuora, different perspectives, experiences and contributions matter. Everyone counts. Zuora is proud to be an Equal Opportunity Employer committed to creating an inclusive environment for all.
Zuora does not discriminate on the basis of, and considers individuals seeking employment with Zuora without regards to, race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics.
We encourage candidates from all backgrounds to apply. Applicants in need of special assistance or accommodation during the interview process or in accessing our website may contact us by sending an email to
-
Senior Site Reliability Engineer
hace 3 días
San José, San José, Costa Rica ARLO Technologies A tiempo completoAbout the RoleWe are looking for aSenior Site Reliability Engineer (SRE)with deepobservability expertiseand astrong software engineering backgroundto helpdefine, drive, and execute our SRE practice.This isnota DevOps, Cloud Engineering, or Operations role. The ideal candidate brings an engineering-first mindset, strong design experience, and the ability to...
-
System Reliability Engineer
hace 1 semana
San José, San José, Costa Rica Veeam Software A tiempo completoVeeam, the #1 global market leader in data resilience, believes businesses should control all their data whenever and wherever they need it. Veeam provides data resilience through data backup, data recovery, data portability, data security, and data intelligence. Based in Seattle, Veeam protects over 550,000 customers worldwide who trust Veeam to keep...
-
DevOps Engineer
hace 1 semana
San José, San José, Costa Rica SeedTrust LLC A tiempo completoAbout the Role:We are seeking a proactive DevOps Engineer with a strong foundation in AWS infrastructure, automation, and platform tooling, combined with Site Reliability Engineering (SRE) practices. This role involves driving automation of deployments, managing containerized environments with Kubernetes, and applying SRE principles to enhance system...
-
Software Development Engineer II, Enterprise Engineering
hace 1 semana
San José, San José, Costa Rica Amazon A tiempo completoDescriptionWe are looking for an experienced Software Development Engineer II with deep expertise in designing and supporting enterprise-scale software solutions. You will be responsible for the design, implementation and quality of services you deliver to Amazonians.Who are you? You have a strong understanding of large scale computing solutions. You have...
-
Data Center Engineer
hace 1 semana
San José, San José, Costa Rica Recluta Talenthunter Latin America A tiempo completoDescripción del PuestoThe Data Center Engineer will design, implement, and support Cisco Data Center technologies to ensure optimal performance and reliability for large-scale networks and high-profile clients worldwide. The role focuses on automating processes, resolving complex incidents, and innovating service capabilities while contributing to global...
-
Software Dev Engineer II, ICON
hace 1 semana
San José, San José, Costa Rica Amazon A tiempo completoDo you want to ensure a delightful shopping experience for millions of ( ) customers everyday? The Intelligent Cloud Hosting (ICON) team is looking for a Software Development Engineer who is passionate to work on the development of innovative software products which will help boost engineers productivity in Amazon and improving services' posture towards...
-
Software Dev Engineer II, ICON
hace 1 semana
San José, San José, Costa Rica Amazon A tiempo completoDESCRIPTIONDo you want to ensure a delightful shopping experience for millions of ( ) customers everyday? The Intelligent Cloud Hosting (ICON) team is looking for a Software Development Engineer who is passionate to work on the development of innovative software products which will help boost engineers productivity in Amazon and improving services' posture...
-
Remote Civil Engineer
hace 6 días
San José, San José, Costa Rica Uptalent A tiempo completois currently hiring a Remote Civil Engineer with expertise in Site Plan Design and Grading to join our team. As a global platform that connects top talent with leading companies, is committed to providing exceptional service to our clients. In this role, you will have the opportunity to work for the most exciting Civil Engineering companies in the U.SThe...
-
Data Engineer
hace 6 días
San José, San José, Costa Rica Plannatech A tiempo completoThe Mid/Senior Data Engineer designs, builds, and optimizes data pipelines and models to enable reliable analytics and reporting. This role combines software engineering best practices with data architecture principles to deliver clean, scalable, and maintainable data solutions. The candidate should have strong experience in PySpark, SQL, and data modeling,...
-
mfg & process dvl engineer ii
hace 1 semana
San José, San José, Costa Rica TE Connectivity A tiempo completoProcess development engineer with solid experience in the planning, design, and optimization of production processes in manual, semi-automated, and automated environments. Proficient in implementing improvements aimed at efficiency, quality, and operational safety, as well as in the validation and transfer of processes from development to mass production....