We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results

Director, Reliability Engineering

Microsoft
United States, Washington, Redmond
Dec 20, 2024
OverviewMicrosoft Silicon, Cloud Hardware Infrastructure Engineering (SCHIE) is the team behind Microsoft's expanding Cloud Infrastructure and responsible for powering Microsoft's "Intelligent Cloud" mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for passionate, high energy engineers to help achieve that mission. As Microsoft's Cloud business continues to grow the ability to deploy new offerings and HW infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for Cloud infrastructure reliability, improving the planning process, manufacturing, quality, delivery at scale, serviceability and sustainability. We are looking for a System Reliability Engineering Leader with a strong passion for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will optimize the Cloud infrastructure and its reliability. We are looking for an experienced System Reliability Director who will be responsible to drive reliability performance across architecture, design, component and material selections, manufacturing and integration of datacenter hardware, ensuring that all electrical, mechanical, thermal, environmental, transportation and operational aspects along with telemetry, diagnostic and the SW/FW stack of the cloud solution are optimized throughout the lifecycle of each cloud service. The candidate will interact with Engineering, Supply Chain, Sourcing, Manufacturing & Quality, Fleet Management, Datacenter Operations, and other internal and external stakeholders.
ResponsibilitiesAs a Director, Reliability Engineering, you will be responsible for the following:Leading the Cloud System and Components Reliability Engineering organization with an ability to operate in a fast-paced environment, transforming ambiguity into clarity. Leading strategic innovations and developing processes which integrate industry practices to ensure scalability and efficiency to achieve high reliability and quality performance. Leading by example and coaching to inspire team members to grow and develop in the field of System and Components Reliability Engineering. Leading retrospective and deep dives to drive root cause and corrective actions to prevent future escapes.Combine technical and process expertise with in-depth understanding of cloud operations, to optimize reliability solutions for future server and storage products.Define, facilitate and manage integration of architecture, design, manufacturing, operation, troubleshooting and diagnostic methods to optimize cloud infrastructure reliability.Participate in, and approve, mechanical, thermal, electrical, telemetry & diagnostic design reviews to ensure system reliability requirements are properly implemented.Drive System Reliability Readiness of new cloud platforms landing in Microsoft Datacenters.Support Hardware Systems Group development, deployment and sustaining teams from system concept to decommission. Work with cross-functional strategic teams on process optimizations and inter-related strategic initiatives.Develop key metrics to evaluate system reliability program's performance and build implementation plans to confirm our performance and compliance against program metrics and internal company requirements.
Applied = 0

(web-776696b8bf-vd2jz)