Site Reliability Engineering Lead

3 weeks ago


Kuala Lumpur, Kuala Lumpur, Malaysia JAC Recruitment Full time

COMPANY OVERVIEW
A well-established client of us in Kuala Lumpur is seeking for Site Reliability Engineering Lead.

JOB RESPONSIBILITIES

  1. Team Leadership:
    Lead and mentor a team of SREs, fostering a culture of ownership, collaboration, and continuous improvement.
    Define clear goals, performance metrics, and development plans for the team.
  2. System Reliability & Performance:
    Design and implement strategies to improve system reliability, scalability, and performance.
    Conduct root cause analysis of production incidents and develop preventive solutions.
  3. Infrastructure Management:
    Oversee the deployment, monitoring, and management of production environments.
    Collaborate with development teams to design cloud-native infrastructure and architecture.
  4. Automation & CI/CD:
    Drive automation of operational processes, reducing manual intervention and response times.
    Optimize CI/CD pipelines to ensure smooth and rapid deployments.
  5. Incident Management:
    Establish incident response protocols and lead efforts during major incidents.
    Ensure robust monitoring and alerting systems are in place to proactively detect issues.
  6. Collaboration & Communication:
    Act as a liaison between engineering, operations, and other teams to align objectives.
    Share insights and best practices with internal stakeholders to enhance overall system resilience.

JOB REQUIREMENTS

  1. Technical Expertise:
    Strong experience with cloud platforms (AWS, Azure, Google Cloud) and infrastructure-as-code tools (Terraform, Ansible, etc.).
    Proficiency in programming/scripting languages (Python, Go, Shell, etc.).
    Deep knowledge of Kubernetes, containerization, and distributed systems.
  2. Leadership Skills:
    Proven track record of leading SRE or DevOps teams and managing large-scale production environments.
    Strong decision-making, prioritization, and problem-solving capabilities.
  3. Monitoring & Metrics:
    Expertise in implementing and using monitoring tools (Prometheus, Grafana, Datadog, etc.) and logging systems.
    Familiarity with service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.
  4. Soft Skills:
    Excellent communication and collaboration skills to work across cross-functional teams.
    Ability to mentor and upskill team members, fostering a learning-oriented culture.
  5. Experience:
    At least 8 years of experience in SRE, DevOps, or related roles with a focus on reliability engineering.

#J-18808-Ljbffr

  • Kuala Lumpur, Kuala Lumpur, Malaysia Tetrasparks Sdn Bhd Full time

    1 week ago Be among the first 25 applicantsDirect message the job poster from Tetrasparks Sdn BhdHuman Resources Recruiter at Tetrasparks Sdn BhdWe are seeking an experienced and dynamic Site Reliability Engineering Lead to drive the reliability, scalability, and performance of our production systems in I-Gaming industry. In this role, you will lead a team...


  • Kuala Lumpur, Kuala Lumpur, Malaysia Businesslist Full time

    A successful Site Reliability Engineer will possess:Hands-on experience with Azure to manage and optimize cloud infrastructure.Exceptional problem-solving abilities, with the resilience to perform well under pressure.Proficiency in Infrastructure as Code (e.g., Terraform) to automate and streamline processes.Expertise in containerization technologies like...


  • Kuala Lumpur, Kuala Lumpur, Malaysia SWIFT Full time

    About the PositionWe are seeking a highly skilled Senior Site Reliability Engineer to lead our team responsible for ensuring the reliability, uptime, and performance of our mission-critical systems. As a Senior SRE Manager, you will be responsible for providing technical leadership, mentorship, and guidance to your team members, as well as collaborating with...


  • Kuala Lumpur, Kuala Lumpur, Malaysia beBee Careers Full time

    Job Summary:We are seeking a highly skilled and experienced Site Reliability Engineering Lead to join our team. As a key member of our organization, you will be responsible for leading a team of engineers in designing, implementing, and maintaining high-quality cloud infrastructure solutions.Main Responsibilities:Lead a team of engineers in the design and...


  • Kuala Lumpur, Kuala Lumpur, Malaysia Businesslist Full time

    Ensure high availability and performance of AWS infrastructure, applications, and services.Design, Architect & Implement auto-scaling, load balancing, and failover strategies in AWS environments.Automate repetitive tasks, workflows, and deployments using scripting or various technologies (e.g., PowerShell, Python, Terraform, Ansible, etc).Deploy...


  • Kuala Lumpur, Kuala Lumpur, Malaysia Unison Consulting Full time

    Site Reliability Engineer (Mandarin Speaker)5 days ago Be among the first 25 applicantsOverview: As a Site Reliability Engineer (SRE), you will play a key role in maintaining the reliability and performance of critical services. Your expertise will help bridge the gap between development and operations, ensuring robust, scalable, and responsive...


  • Kuala Lumpur, Kuala Lumpur, Malaysia Randstad (Schweiz) AG Full time

    Site Reliability Engineer / SRE (Hybrid) | KLHybridOverview:As a Site Reliability Engineer (SRE), you will play a key role in maintaining the reliability and performance of critical services. Your expertise will help bridge the gap between development and operations, ensuring robust, scalable, and responsive infrastructure. This role emphasizes strong system...


  • Kuala Lumpur, Kuala Lumpur, Malaysia Businesslist Full time

    A highly skilled Site Reliability Engineer will play a pivotal role in ensuring the smooth operation of our cloud infrastructure. They will be responsible for managing and optimizing our Azure environment to meet the evolving needs of our business.This is an exciting opportunity for a technical expert to join our team at Businesslist, where we are dedicated...


  • Kuala Lumpur, Kuala Lumpur, Malaysia Hunters International Sdn Bhd Full time

    Overview:As a Site Reliability Engineer (SRE), you will play a key role in maintaining the reliability and performance of critical services. Your expertise will help bridge the gap between development and operations, ensuring robust, scalable, and responsive infrastructure. This role emphasizes strong system architecture and design principles, focusing on...


  • Kuala Lumpur, Kuala Lumpur, Malaysia ServeDeck Innovation Sdn Bhd Full time

    Direct message the job poster from ServeDeck Innovation Sdn BhdHR Professional | Talent Acquisition | Labor Law SavvyResponsibilitiesDesign, manage and optimize Continuous Integration and Continuous Delivery.Monitor application and infrastructure (cloud) to ensure system security and availability using the real time dashboard and active alerting...