THIS IS A W2 CONTRACT ONLY
Job ID#:
36061
Title:
Site Reliability Engineer (SRE)
Duration: 3-6 months
Start Date:
ASAP
Location:
Charlotte, NC OR Oakland, CA
Onsite/Hybrid/Remote:
Hybrid
Job Description:
We are seeking a highly motivated temporary worker with experience in Site Reliability Engineering to join our SiteOps Engineering team. As an SRE, you will be reporting to the Senior Manager of Engineering and will be responsible for ensuring the reliability of our ecosystem (both Native and Web), contributing to overall change, incident and problem management, and partnering with cross-functional teams to drive continuous improvement. This is an excellent opportunity for a technical contributor to evolve our technology through automation, reliable architecture and help increase velocity by collaborating across engineering to facilitate adoption of best practices.
Responsibilities:
- Contribute to overall change, incident and problem management in our environment with a focus on troubleshooting and fast restoration of our essential services and preventing future outages.
- Participate in a once-a-month 24x7 on-call rotation and take leadership of severe incidents to help minimize impact.
- Assist engineering teams by conducting truly blameless post mortems with focused action items to drive continuous improvements.
- Provide insights on trends of issues affecting reliability and partner in cross functional projects to provide scalable solutions.
- Review and advise on high-risk platform changes to minimize impact to the site and maximize success for stakeholders.
- Work within a large distributed system based on Cloud Native services.
- Maintain an automation-centric vision and incorporate SRE methodologies to increase reliability and decrease toil.
- Create operating standards to help drive reliability
Requirements:
- 5+ years of experience with Site Reliability Engineering with a focus on Infrastructure, Platform, and Application (Cloud, Containerization, Container orchestration, Network, Application Reliability, Database Architecture) and an understanding of full stack and SDLC practices (Software Development Life Cycle) in DevOps or continuous release environment.
- Experience in running critical incidents in a global or company-wide context, engaging with executives and senior leadership, and leading root cause analysis sessions.
- Experience running and monitoring applications at scale, using metrics and tracing tools like, New Relic, Data Dog, Stackdriver, Zipkin, Prometheus, etc.
- Professional experience with Python, Go, or similar programming languages.
- Familiarity with SRE methodologies; passionate about solving operational challenges by using automation and software.
- Ability to communicate effectively vertically and horizontally within the organization through demonstrating written and verbal communication skills.
- Ability to drive troubleshooting through a pragmatic and collaborative approach.
- Can construct clear and concise insights from data to promote and champion measurable improvements.
- Experience working with Cloud Native services in a Public Cloud, e.g. Google Cloud Platform, AWS, Azure.
DeWinter Group and Maris Consulting is an equal opportunity employer, and all qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws. We post pay scales which are based on our client pay ranges. DeWinter, Maris, and our clients have the right to modify the requirements of the role which can impact the pay ranges posted.