Join our mission
Intuit is a global technology platform that helps our customers and communities overcome their most important financial challenges. We help give over 50 million consumer, small business, and self-employed customers around the world the opportunity to prosper.
Come join the Site Reliability Engineering (SRE) team at Intuit's Consumer Group as a Staff engineer, and be part of our mission to power prosperity around the world!
Site Reliability Engineers leverage software engineering and systems engineering skills to design and build large scale consumer experiences which are reliable, operable, secure, highly available, disaster-ready, and performant. Join a world-class engineering team and utilize your programming and operations talents to apply the latest patterns in continuous deployment, progressive delivery, cloud operations, containerization, and server-less technology and help us build the next generation platform capabilities and delight millions of customers.
We consider our people as the most important asset - we take the growth of our engineers seriously.
The ideal candidate is a Senior Tech Lead who has extensive experience in leading transformations in SRE space from operations to building platform capabilities with focus on technology and people. This includes building solutions for managing cascading failures, self-healing and high availability architectures, autoscaling, system resiliency, chaos testing, system performance as a whole, throttling, observability, AWS architecture, CI/CD, and canary deployments for consumer facing products. Extensive experience in leading peak war room activities and season readiness initiatives is mandatory. The ideal candidate is conversant not only with overall system design but also with microservice development and best practices. The engineer will be hands-on and lead initiatives in design, development, testing, maintenance, and documentation of systems patterns and capabilities that can be leveraged across services at Intuit, using industry best practices. Mentors colleagues from a technical and growth perspective and continuously develops a strong talent pipeline
What you'll bring
- BS/MS in Computer Science or equivalent
- 7+ years of solid experience with production operations managing large scale and highly available systems.
- 3+ solid years of hands-on experience working on AWS (EC2, ALB, VPC, Route53, DynamoDB, RDS, IAM, etc).
- 2+ solid years of hands-on experience building highly available systems on Kubernetes (Docker, ArgoCD, Prometheus, CNCF tools, AWS EKS)
- 2+ years of solid hands-on DevOps experience. AWS CFN, CMS using Chef - Preferred, Terraform - Preferred, Ansible, or Salt experience in server provisioning automation with testing automation.
- Experience with web and application servers such as Apache, NginX - Preferred, Envoy, HAProxy, Tomcat - Preferred, or JBoss.
- Experience with setup and configuration of logging and monitoring tools (Prometheus - preferred, Splunk - preferred, Micrometer, AppDynamics, Wavefront, Pagerduty).
- Solid understanding of networking including TCP/IP stack, basic switching/routing concepts.
- Experience with CICD tools (ArgoCD - preferred, Jenkins - preferred, Spinnaker, or CodePipeline).
- Experience with big data stores, NoSQL, RDBMS (DynamoDB - preferred, EMR - preferred, Vertica - preferred, Cassandra - preferred, RDS MySQL, Hbase, or MongoDB).
- Experience with High Performance Computing (HPC) and Distributed File Systems (DFS) a plus.
- Must possess strong management, analysis, and organizational, verbal and written communication skills.Excellent communication skills: Demonstrated ability to explain complex technical issues to both technical and non-technical audiences.
- Open-source contributions would be a plus.
How you will lead
- Contribute to the SRE roadmap, break down quarterly milestones and drive deliverables with a right balance of run-the-business effort and new platform capabilities.
- Lead technology and skills transformation in the SRE domain by identifying trends/patterns in operational issues and building scalable automated solutions.
- Leading medium to large scrum teams to design & implement system-wide capabilities to increase the scalability, resiliency and observability for massive scale platforms.
- Drive season readiness and peak support activities efficiently for multiple products.
- Apply high availability and self-healing principles to architect software systems, which aligns well with the product and Intuit ecosystem.
- Conduct the performance testing for the platform, focusing on responsiveness and optimal resource usage.
- Contribute to FMEA (Failure Mode Effective Analysis) and Chaos Engineering for critical platform components, identifying resiliency gaps and preparing the team for faster recovery from production incidents.
- Enable progressive rollout for platform changes via canary deployment and auto-rollbacks based on platform health.
- Contribute to the cost and capacity management for various platform components, uncovering cost saving opportunities and automation to enforce them.
- Hands-on with troubleshooting and root-cause analysis of incidents in both PROD and pre-PROD. Drive and own Root Cause Analysis (RCA) for specific applications.
- Work cross-functionally and collaborate with various Intuit teams including: product management, engineering teams, various product lines, and/or business units to drive forward results
- Acting as the technical subject matter expert: Mentoring fellow engineers, demonstrating technical expertise, and leading a small team solving challenging programming and design problems.
- Build tools to enable platform consumers to troubleshoot and triage issues in a self-serve mode.
- Troubleshooting complex issues, and managing stakeholders expectations during incidents while troubleshooting.
- Participate in 12/7 oncall rotations along with dev team
- Supporting and coaching other engineers, pair programming or peer reviewing code, helping to ensure that all engineers are growing and part of a community
- Exercises independent judgment in the selection of methods and techniques used to deliver operational solutions. Considers build, buy and partnering alternatives in the selection process
- Creates formal internal and external networks outside of own area of expertise to leverage and adopt ideas, technologies and best practices that helps the organization move fast
- May influence organizational goals beyond a specific project