Site Reliability Engineer
3 days ago
We are a technology company at the forefront of high-performance computing (HPC) with a strong foundation in applied physics. Our innovative hardware and software solutions for the global technology and resource sectors enable our clients to leverage large and complex data sets.
We operate three world-class green supercomputer clusters, running a large suite of scientific applications. The research and development team of mathematicians, geophysicists, and software engineers are responsible for creating and maintaining this suite of signal-processing and subsurface imaging tools.
We are running a large, combined Python, C/C++ and Java application on thousands of servers as a large distributed application. We are looking for an experienced software engineer to improve the reliability and fault tolerance of the system in cooperation with our platform and IT teams. A part of the responsibilities will be maintaining and extending our graph-based task scheduler which underlies our distributed application.
When submitting your application, you'll show that you have an attention to detail by including 'Shibboleth' in your cover letter.
Responsibilities:
- Focusing on the needs of our distributed cluster applications, identify gaps in our application and system monitoring, and implement solutions in collaboration with the platform and IT teams.
- Further optimise and develop our graph-based task scheduler, improving resiliency and efficiency of the scheduling decisions.
- Contribute to design and implementation of the distributed architecture with a focus on reliability and fault tolerance.
- Inspection and maintenance of software written by other members of the team.
- Providing and receiving regular, constructive feedback to and from your peers.
- Acting as a mentor for an exceptional intern or junior developer.
Requirements:
- Demonstrable expert-level skills as a software developer.
- Experience with at least one of our programming languages, Python, Java, C or C++.
- A good understanding of the different components of a Linux-based HPC cluster (network, filesystem, …) or alternatively experience with SRE.
- Excellent written and spoken business and technical English.
- Verification of your right to work in the respective location.
- Provision of applicable and relevant qualifications.
- Nationally approved criminal history check.
-
Site Reliability Engineer
6 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Chinasoft International (CSI) Full timeCompany DescriptionWe suggest you enter details here.Role DescriptionThis is a full-time on-site role for a Site Reliability Engineer based in WP. Kuala Lumpur. The Site Reliability Engineer will be responsible for maintaining system reliability and availability. Daily tasks will include troubleshooting issues, ensuring proper infrastructure setup, and...
-
Site Reliability Engineer
3 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia MindPec Solutions Full timeJob DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at MindPec Solutions. As a key member of our infrastructure team, you will be responsible for developing, deploying, and maintaining reliable, high-performance site infrastructure.Your primary focus will be on implementing and managing comprehensive monitoring solutions...
-
Senior Site Reliability Engineer
3 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia MindPec Solutions Full timeRole: Senior Linux Site Reliability EngineerClient: IT ServicesWorking Mode: On Site (Monday to Friday 9AM to 6 PM)Job Type: PermanentExperience: More than 3 years experience in Site Reliability Engineer or DevOps Engineer.Applicants: Open to Local candidates who can speak English & Chinese.JOB DESCRIPTIONDevelop, deploy, and maintain reliable,...
-
Site Reliability Engineer
6 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia PERSOLKELLY Malaysia Full timeThis job is for a Site Reliability Engineer focusing on network systems. You might like this job because you'll automate tasks, ensure services run smoothly with minimal downtime, and work with coding languages like Java or Python in a dynamic tech environmentJob Description:Ability to debug scripts and automate routine tasks in OS, network, database or...
-
Site Reliability Engineer
5 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Glints Full timeGlints Federal Territory of Kuala Lumpur, MalaysiaSite Reliability EngineerReady to elevate your career with a globally recognized professional services firm? We are seeking a skilled DevOps / SRE Specialist to join our team. You'll be at the forefront of transforming business challenges into cutting-edge technology solutions, working alongside diverse...
-
Site Reliability Engineering Director
3 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia SWIFT Full timeAbout the PositionWe are seeking a highly skilled Senior Site Reliability Engineer to lead our team responsible for ensuring the reliability, uptime, and performance of our mission-critical systems. As a Senior SRE Manager, you will be responsible for providing technical leadership, mentorship, and guidance to your team members, as well as collaborating with...
-
Senior Site Reliability Engineer
4 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Swift Software Full timeAt Swift Software, we are dedicated to creating a culture that values diversity, intellectual curiosity, problem-solving, and openness. We encourage collaboration, big thinking, and risk-taking in a supportive, blame-free environment.About the RoleSet annual objectives for team members.Conduct performance appraisals and provide constructive feedback to...
-
Site Reliability Engineering Leader
4 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Swift Software Full timeAs a Senior Site Reliability Engineering Manager at Swift Software, you will lead a team responsible for providing the platform for mission-critical systems to maintain constant uptime, scale seamlessly, and enable new applications and services to flourish.About the RoleRecruit and retain engineers with diverse perspectives.Provide coaching, mentorship, and...
-
Site Reliability Expert
7 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia ServeDeck Innovation Sdn Bhd Full timeJob Description:We are seeking a highly skilled Site Reliability Engineer to join our team at ServeDeck Innovation Sdn Bhd. As a key member of our engineering team, you will be responsible for designing, managing, and optimizing our continuous integration and continuous delivery pipelines.You will work closely with our development and support teams to ensure...
-
Site Reliability Engineer
5 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia PERSOLKELLY Malaysia Full timeAchieve service excellence as a Site Reliability Engineer specializing in network systems! You will join a team that aims to deliver high-quality services through automation, reliability and efficiency.Key Job Duties:Debug scripts and automate routine tasks in OS, network, database or application serversCoding experience beyond simple scripts is requiredOur...
-
Site Reliability Systems Manager
6 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia WCC Full timeJob DescriptionSysadmin, IT Operations, and DevOps ExpertiseDistributed Production Load ManagementContinuous Integration, Continuous Deployment, and Continuous ImprovementWe are seeking a skilled Site Reliability Engineer to support our product owners and DevOps team in determining which new features can be launched and when, using service-level agreements...
-
Site Reliability Specialist
5 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Glints Full timeWe are seeking a skilled DevOps/SRE specialist to join our team at Glints. This role will be at the forefront of transforming business challenges into cutting-edge technology solutions.Key ResponsibilitiesConduct research and analysis to develop innovative, technology-enabled business solutions.SUPPORT THE DESIGN AND DELIVERY OF DIGITAL SOLUTION...
-
Reliability Engineer
7 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Businesslist Full timeResponsibilities:The successful Reliability Engineer will be responsible for designing and implementing reliability solutions to improve product quality and reduce costs at Businesslist.Key responsibilities include developing and implementing reliability engineering strategies, conducting failure mode and effects analysis (FMEA), and identifying...
-
System Reliability Engineer
3 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia InsiderSecurity Full timeInsiderSecurity: A Cybersecurity LeaderWe are a leading provider of cybersecurity solutions, dedicated to helping organizations protect themselves against ever-evolving threats. As a DevOps/Site Reliability Engineer, you will play a critical role in ensuring the reliability, quality, and time-to-market of our products.About the PositionDesign, develop, and...
-
Reliability Engineering Expert
4 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Businesslist Full timeWe're looking for a seasoned Site Reliability Engineer to join our team at Businesslist. The ideal candidate will be skilled in managing and optimizing cloud infrastructure using Azure.About the RoleYou'll be responsible for ensuring the smooth operation of our systems, with a focus on scalability and reliability.Proficiency in Infrastructure as Code (e.g.,...
-
Reliability Engineer
3 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Exxon Mobil Full timeAt ExxonMobil, our vision is to lead in energy innovations that advance modern living and a net-zero future. As one of the world's largest publicly traded energy and chemical companies, we are powered by a unique and diverse workforce fueled by the pride in what we do and what we stand for.The success of our Upstream, Product Solutions and Low Carbon...
-
Reliability Engineering Specialist
7 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Elsa Energy Full timeReliability Engineering SpecialistElsa Energy is seeking a skilled Reliability Engineering Specialist to enhance the reliability and efficiency of our physical asset management systems and processes.The ideal candidate will possess a strong background in Six Sigma methodologies and a working understanding of artificial intelligence (AI) applications. Key...
-
Software Reliability Specialist
5 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Chinasoft International (CSI) Full timeWe're looking for a skilled Software Reliability Specialist to join our team at Chinasoft International (CSI). This is a full-time on-site role based in Kuala Lumpur, where you'll play a crucial role in maintaining system reliability and availability.Your daily tasks will involve troubleshooting issues, setting up infrastructure, and performing system...
-
System Reliability Engineer
6 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia Net2Source Inc. Full timeAbout the PositionWe are seeking a System Reliability Engineer to join our IT team in Kuala Lumpur, Malaysia. In this role, you will be responsible for ensuring the reliability and performance of our systems and applications.Main ResponsibilitiesDesigning and implementing system reliability solutions.Monitoring and analyzing system performance, identifying...
-
Site Engineering Manager
4 days ago
Kuala Lumpur, Kuala Lumpur, Malaysia SPECIFIC DIMENSION SDN BHD Full timeJob OverviewWe are seeking a highly skilled Site Engineer to join our team at SPECIFIC DIMENSION SDN BHD. As a Site Engineer, you will be responsible for overseeing the execution of construction projects, ensuring timely completion and quality delivery.ResponsibilitiesProject Planning and Execution: Develop and implement project plans, schedules, and budgets...