Site Reliability Engineer Remote Jobs

Lead Site Reliability Engineer

WorkableMarousi,Attica,Greece, Remote Hybrid

Design ● kubernetes

Workable is hiring a Remote Lead Site Reliability Engineer

For over 31,000 growing businesses and HR teams seeking a comprehensive, all-in-one HR suite, Workable emerges as the premier solution. We uniquely combine the world’s most widely adopted Applicant Tracking System (Workable Recruiting) with a full-spectrum employee management system (Workable HR). At Workable, we empower companies to focus on what truly matters: hiring the right people and fostering their growth.

While we take HR seriously, we maintain a lighthearted and collaborative culture. At Workable, you’ll find smart people who have fun, learn, innovate, and help others do the same. We respect everyone, we hire the best, and make sure every experience is special.

We’re growing fast and we want to make sure that we scale from thousands to hundreds of thousands so we’re looking for a Lead Site Reliability Engineer to join our SRE team.

Our product is built with a microservices architecture deployed on the Kubernetes platform. Our SRE team is responsible for deploying, monitoring, optimizing, and securing our cloud infrastructure and company software; both rapidly expanding. Automation is at the core of what we do. If you love working with new technologies, open-source software, and solving complex problems on highly distributed systems then this is the job for you! You will be part of a talented team of engineers that demonstrate superb technical competency, delivering mission-critical infrastructure and ensuring the highest levels of availability, performance, and security.

Responsibilities

As a Lead Site Reliability Engineer in this team with a focus on tools and automations, you will be responsible for the following:

Develop tools and automations to make operations and deployments simpler and more robust.
Operate, deploy, and monitor cloud services from development to production.
Working in a highly cross-functional team with Developers on designing, releasing, and troubleshooting production systems.
Be responsible for the availability, scalability, and performance of our systems.
Troubleshoot issues, do capacity planning, and analyze system performance.
Guide the team to build, design, and deliver high-quality solutions in line with the best practices of the department
Lead projects across teams in the company to design and implement high-impact solutions that will improve development efficiency and optimize our infrastructure.

BS/MS degree in Computer Science, Engineering (or a proven strong background)
Excellent communication skills in English, particularly written communication.
Analytical and troubleshooting skills on large-scale distributed systems
Work independently and be able to deliver projects on time.
Experience in leading small teams and delivering large cross-functional projects
Passionate about building automations, DevOps practices, and exploring new technologies
Experience with at least one programming language (Go preferred)
10+ years of relevant work experience, including programming experience
Experience with Kubernetes platform and technology stack
Experience with a major cloud provider (GCP and AWS preferred)
Experience with configuration management and orchestration tools (e.g., Ansible, Terraform)
Experience with centralized logging, monitoring systems, and tooling frameworks
Familiarity with Relational and NoSQL (MongoDB, Redis, Elastic, etc.) databases
Deep knowledge of Linux systems
Oh, and if you're into DevOps technologies and the CNCF ecosystem, but have experience with other frameworks, please do apply. We value quality engineers, not the tools they've used.

Preferred qualifications:

Bonus: Networking skills, especially TCP/IP, HTTP, DNS and load balancers
Bonus: Good programming skills in Go
Bonus: Experience with remote working

Our employees enjoy benefits that make them more productive and contribute directly to the development of their professional skills. We want to be able to attract the best of the best and make sure they keep getting better. On top of an exciting, vibrant and intellectually challenging environment, we are offering:

An attractive salary and a bonus plan
Health insurance plan including dependents
Mobile data plan
Apple gear and access to the best productivity tools

Workable is most decidedly an equal opportunity employer. We want applicants of diverse background and hire without regard to colour, gender, religion, national origin, citizenship, disability, age, sexual orientation, or any other characteristic protected by law.

See more jobs at Workable

12d

SGSWinnipeg | Calgary | Toronto, Canada, Remote

Bachelor degree ● Design ● ansible ● azure ● .net ● angular

SGS is hiring a Remote Site Reliability Engineer

Job Description

The Site Reliability Engineer will play a critical part in ensuring the reliability, supportability, scalability, and performance of our .NET stack applications built with ASP.NET MVC, Angular, and Web API.

Partner with developers and product operations teams to understand application requirements and translate them into operational practices.
Design, implement, and maintain infrastructure automation tools using Infrastructure as Code (IaC) methodologies.
Monitor application health and performance metrics, proactively identifying and resolving potential issues.
Implement incident response procedures to ensure timely resolution of outages and service disruptions.
Establish and improve best practices for product solution design / architecture, and development.
Participate in peer and team code reviews by developing comprehensive coding standards and guidelines to ensure consistency, maintainability, and quality in software development. By establishing clear protocols for code formatting, naming conventions, error handling, testing, and documentation, we can enhance code readability, reduce defects, and facilitate knowledge sharing among team members.
Collaborate with engineers to develop and implement disaster recovery plans.
Continuously improve monitoring and alerting processes to ensure efficient problem identification and resolution.
Stay up-to-date on the latest advancements in .NET infrastructure and SRE best practices.

Qualifications

Bachelor degree required
Minimum 3+ years of experience in a related technical role (e.g., Systems Administrator, Network Engineer) required
Experience with configuration management tools like Ansible, Puppet, or Chef preferred
Azure experience required
Familiarity with monitoring and alerting tools (.NET performance counters, Azure App Insight, Prometheus, Grafana) is a plus preferred
Ability to manage and coordinate multiple projects in a fast paced, highly professional environment.
While coding proficiency is not required, a strong understanding of the .NET ecosystem and a desire to delve into infrastructure and automation will be essential for success.
Strong understanding of system administration principles, including operating systems (Windows Server preferred) and networking concepts.
Familiarity with monitoring and alerting tools (.NET performance counters, Azure App Insight, Prometheus, Grafana)
Ability to work independently and as part of a team

See more jobs at SGS

Senior Site Reliability Engineer - US/Canada

14d

DataVisorResearch Triangle Park,North Carolina,United States, Remote

mobile

DataVisor is hiring a Remote Senior Site Reliability Engineer - US/Canada

DataVisor is a next generation security company that utilizes industry leading unsupervised machine learning to detect fraudulent activity for financial transactions, mobile user acquisition, social networks, commerce and money laundering. Our solution is used by some of the largest internet properties in the world, including Pinterest, FedEx, AirAsia, Synchrony Financial, Zomato and Ping An, to protect them from the ever-increasing risk of fraud. Our award-winning software is powered by a team of world-class experts in big data, security, and scalable infrastructure. Our culture is open, positive, collaborative, and results driven. Come join us!

We are seeking a Senior Site Reliability Engineer (SRE) to join our growing team. The ideal candidate will have a passion for building reliable systems, experience with automation, and a solid understanding of large-scale distributed systems. You will work closely with the engineering team to improve reliability, scalability, and performance across our infrastructure.

You will report to CTO direclty and be working with a team of seasoned engineers to automate, increase the reliability and enhance the security of our production environment. Projects include scaling our global, multi-cloud footprint, optimize our large real-time decision platform and improve the reliability of our global cloud footprint.

5+ years of experience with production environment running Linux

3+ years of experience working with cloud solutions such as AWS, Azure or Aliyun

Familiar with big data technology such as Spark and/or Flink

Love to automate tasks through coding and scripting

Experience with algorithms, data structures, complexity analysis and software design

Code well on Python, Java and Bash

Key Responsibilities:

Design, implement, and maintain release automation pipelines to streamline the deployment process.
Develop systems for proactive monitoring, auto-diagnosis, and incident resolution in production environments.
Work with big data platforms such as Apache Spark or Apache Flink, optimizing and scaling our data processing pipelines.
Perform maintenance and troubleshooting for databases, with preference for experience in Yugabyte, ClickHouse, and MySQL.
Ensure the reliability of cloud infrastructure using Kubernetes on AWS or GCP.
Participate in on-call rotation to ensure system reliability, with a focus on automation to minimize manual intervention.
Collaborate with engineering teams to improve system performance and manage capacity planning.

PREFERRED EXPERIENCE

Familiar with container technology such as Docker, Kubernetes
Experience with database system best practices on Yugabyte, Clickouse and MySQL etc.
Strong understanding of security best practices
Completed a SOC 2/PCI certification in the past is a big plus

Health insurance
PTO and sick days
401K Plan

See more jobs at DataVisor

Lead Site Reliability Engineer

15d

Plum FintechAthens,Attica,Greece, Remote Hybrid

DevOPS ● redis ● terraform ● RabbitMQ ● postgresql ● kubernetes ● python

Plum Fintech is hiring a Remote Lead Site Reliability Engineer

At Plum, we're on a mission to maximise wealth for all. We’re making saving money effortless and turning investing into something everyone can do. Our journey began back in 2017, when we became one of the first to use artificial intelligence and automation to simplify personal finance. Fast forward to today, and we've already helped people save £2 billion across 10 European markets.

Named the UK's fastest-growing fintech in the Deloitte Technology Fast 50, our success is down to the passion and dedication of our diverse team. Based in our London, Athens and Nicosia offices, 170 talented people work together to empower people to do more with their money. And now, the team is growing!

The Role

You will be joining our Infrastructure squad as a Lead Site Reliability Engineer to ensure that Plum’s systems are resilient, secure, scalable, observable and fully capable to support our growth. You will support our Engineering function to use our infrastructure in the most efficient way. You’ll proactively identify areas of improvements and propose initiatives to make the SRE function more streamlined and with reduced overhead.

What you will be doing:

Lead the SRE team in their daily work, provide mentoring and growth their skills and career
Identify initiatives to improve efficiency, raise the bar of the SRE function, prioritise team’s work, define a strategic vision aligned with company’s goals
Be an advocate of costs management (FinOps) and able to propose solutions to optimise our infrastructure
Be hands on for daily work and to contribute to initiatives owned by the team
Operate and scale our infrastructure (GCP, Kubernetes, PostgreSQL, RabbitMQ, Redis). We have data on the size of TBs that need to be blazing fast
Automate aspects of systems using infrastructure management tools of the trade (we use Terraform). Code once, deploy everywhere mindset
Ensure our metrics give an accurate picture of how the system is performing (we use Prometheus). Leverage observability in your day-to-day processes
Build and maintain SLIs and SLOs for our infrastructure; provides a platform for squads to build their SLIs and SLOs on top of collected metrics
Lead incident response and troubleshoot issues, correcting and improving systems to prevent incidents and grow at scale. Take point in handling service degradation
Collaborate with our Engineering function to deliver their craft into Plum infrastructure
Collaborate with the Principal Engineer to improve the Engineering function’s DevOps posture

For this role, we'd like to see:

Working experience of 5+ years as a Site Reliability Engineer, DevOps or of a similar position
Working experience of 2+ years leading an SRE squad to success
Proficiency in managing cloud infrastructure as IaaC with tools like Terraform
Ability to maintain the IaaC codebase in a optimal and efficient way (clear codebase structure, Terraform modules, etc.)
Strong expertise in system architecture, networking, database management, administration of Kubernetes clusters
Strong expertise in observability (Logging, Monitoring, Tracing)
Analytical skills, troubleshooting attitude
Proactive approach on problems, able to identify them and propose solutions
Passion for continuous improvement and challenging the status quo
Excellent communication skills in English (verbal and written)

Good to have

Familiarity with RDBMS databases management and migration procedures with zero downtime
Having built an SRE team from scratch focusing on efficiency
Proven stakeholder management skills and the ability to negotiate priorities with internal teams
Experience in Python, ability to navigate large codebases

Plum's Perks

We're all in this together! Own part of the company through stock options ????
Annual training budget
Private Health & Life Insurance
Free Plum Premium subscription (normally £9.99 a month).
Free parking slots
25 days holiday a year, excluding public holidays
Employee referral scheme up to €4000
Flexible approach to remote working, though we encourage at least 2-3 days a week in our beautiful office in central Athens for optimal collaboration.
45 days work from anywhere
Team breakfast on Tuesdays and team lunch on Thursdays in the office, as well as a plentiful supply of fruit, snacks and coffee.
1 day paid leave for volunteering, supporting you giving back to society.
2 weeks paid sabbatical after four years of service.
Team trip to secret destinations once a year ✈️
Great office location in the heart of Athens (Syntagma square), with an amazing view!
A vibe that’s ????????????

If you think this sounds like a bit of you then don’t hesitate to get in touch!

Thanks,

Plum Τeam ????

* Plum is an Equal Opportunity Employer. Plum does not discriminate on the basis of age, race, religion, sex, gender identity, sexual orientation, non-disqualifying physical or mental disability, national origin or any other basis covered by appropriate law. All employment is decided on the basis of qualifications, merit and business need.

See more jobs at Plum Fintech

18d

FlywireBoston, MA, Remote

S3 ● SQS ● EC2 ● Lambda ● TDD ● kotlin ● terraform ● Design ● ansible ● ruby ● java ● docker ● elasticsearch ● linux ● jenkins ● python ● AWS ● PHP

Flywire is hiring a Remote Senior Site Reliability Engineer

Job Description

The Opportunity:

We, at Flywire, are looking for an experienced engineer to join as the Site Reliability Engineering team in North America to help drive reliability, automation and performance in our cloud-based infrastructure.

At Flywire the SRE team is responsible for the lifecycle of production systems and working across multiple levels. SRE’s are embedded in the development teams enabling and empowering them to achieve full speed on shipping reliable and operable systems. They also work at a global scale driving initiatives to achieve production excellence.

Design, build and maintain core infrastructure pieces with a focus on availability, latency, performance, and capacity. Reduce toil and improve deployment timelines (40%)
Be embedded on a development team, support them with daily tasks, help drive towards production excellence and advocate for best practices. (40%)
Be part of an on-call rotation. Debug production issues across services and levels of the stack and practice incident response and blameless postmortems. (15%)
Engage and collaborate with other disciplines in the design, deployment, operation, and optimization of services. (5%)

What’s next in our team?:

Pushing SRE practice to the next level
Keep fast release pace whilst keeping the environment secure
Migrate legacy custom deployments to managed cloud provider offerings (eg: RDS)
Keep pursuing full DevSecOps

Qualifications

Here’s What We’re Look For:

5+ years of experience as an SRE or similar role. Experience as a Software Engineer or Systems Engineer is also valuable.
We are aiming to build a multidisciplinary and balanced team based on "t-shaped" individuals. As such we are looking for people comfortable with the idea of being or becoming a generalizing specialist.
Software engineering is an important part of our work, we actively use and support many different platforms and languages. Experience with at least one programming language is needed, also experience with testing techniques such TDD or BDD will be highly valued.
Being familiar with the container ecosystem, cloud infrastructure, build systems and CI/CD tools is key for being successful at this role.
You will need to be comfortable taking ownership of complex systems challenges and help uncover opportunities for improvement.
At SRE we are enablers, we empower and encourage our fellow colleagues so you will need to have strong communication and collaboration skills, and most importantly, empathy.
Strong preference for candidates located near our geo-clusters and hubs in the following locations: Boston, New York City, Portland, Charlotte, Chicago, Austin, Dallas, Minneapolis, Kansas City, FL, PA, RI, & TN

Bonus:

Experience with PCI compliant infrastructure (particularly in a modern CICD environment)
Expertise with Infrastructure as Code tooling.

Technologies We Use:

These are some of the technologies that we use, but we we are always learning, experimenting and open to change:
Ruby, Bash/Shell, Java,, Kotlin, Go, Node, Python, PHP
AWS: EC2, ECS, Lambda, Cloudwatch, SQS, RDS, Kinesis, S3, ElasticSearch, DocumentDB
Linux, Docker, Terraform, Make, Chef, Ansible
Gitlab, Jenkins (CI/CD)
Sentry, Sumologic, New Relic, Grafana. OTEL (OpenTelemetry)

See more jobs at Flywire

27d

iRhythmRemote

DevOPS ● Bachelor's degree ● terraform ● Design ● azure ● api ● qa ● java ● c++ ● docker ● kubernetes ● linux ● AWS

iRhythm is hiring a Remote Senior Site Reliability Engineer

Boldly innovating to create trusted solutions that detect, predict, and prevent disease.

Discover your power to innovate while making a difference in patients' lives. iRhythm is advancing cardiac care…Join Us Now!

At iRhythm, we are dedicated, self-motivated, and driven to do the right thing for our patients, clinicians, and coworkers. Our leadership is focused and committed to iRhythm’s employees and the mission of the company. We are better together, embrace change and help one another. We are Thinking Bigger and Moving Faster.

Position: Senior Site Reliability Engineer

Responsibilities include:

iRhythm Technologies, Inc. is seeking a Senior Site Reliability Engineer to provide services for Engineers and SQA to build, provision, deploy and manage their application stacks in AWS and other Cloud Platforms. Write, build and deploy services globally and at scale. Write tools to automate the entire development lifecycle. Work with toolsets including AWS API, Java, Go, configuration management tools and more. Drive standards, tooling and services that is used by internal teams for all aspects of running an application. Work closely with Developers, QA and Operations staff to design and build automated processes for application and database migration, ensuring scalability, reproducibility, auditability and traceability. Provide automation and support throughout the processes required to get a product built and released into test and production environment. Troubleshoot build and deploy issues, and facilitate resolution. Maintain and enhance the automated continuous integration and continuous delivery environment. Evaluate and recommend new tools to improve build and release processes, with a goal of zero downtime releases. Communicate status frequently to all stakeholders. Document any new or changed processes. Telecommuting permitted. 20% domestic and 20% international travel required.

SALARY: $155,958 to $166,200 per year.

JOB REQUIREMENTS:

Requires a Bachelor’s degree in Electronic Engineering, Communications Engineering, Computer Sciences, or a related field, and 4 years of networking and devops experience. Must have experience with: AWS or Azure; Linux system administration skills such as OS deployments, patching and management; Configuration management tools such as Terraform; Container management technologies such as Docker and Kubernetes for automated deployments; Networking concepts and protocols including 5G networks, DNS, TCP/IP, IPSEC SSL VPN and firewalls; and Incident, Change and Operations Management. Telecommuting permitted. 20% domestic and 20% international travel required.

THIS POSITION IS ELIGIBLE FOR THE EMPLOYEE REFERRAL PROGRAM

Actual compensation may vary depending on job-related factors including knowledge, skills, experience, and work location.

Estimated Pay Range

$155,958—$166,200 USD

As a part of our core values, we ensure a diverse and inclusive workforce. We welcome and celebrate people of all backgrounds, experiences, skills, and perspectives. iRhythm Technologies, Inc. is an Equal Opportunity Employer. We will consider for employment all qualified applicants with arrest and conviction records in accordance with all applicable laws.

iRhythm provides reasonable accommodations for qualified individuals with disabilities in job application procedures, including those who may have any difficulty using our online system. If you need such an accommodation, you may contact us at taops@irhythmtech.com

About iRhythm Technologies
iRhythm is a leading digital healthcare company that creates trusted solutions that detect, predict, and prevent disease. Combining wearable biosensors and cloud-based data analytics with powerful proprietary algorithms, iRhythm distills data from millions of heartbeats into clinically actionable information. Through a relentless focus on patient care, iRhythm’s vision is to deliver better data, better insights, and better health for all.

Make iRhythm your path forward. Zio, the heart monitor that changed the game.

See more jobs at iRhythm

28d

InvocaRemote

salesforce ● c++ ● docker ● kubernetes ● linux

Invoca is hiring a Remote Senior Site Reliability Engineer

About Invoca:

Invoca is the industry leader and innovator in AI and machine learning-powered Conversation Intelligence. With over 300 employees, 2,000+ customers, and $100M in revenue, there are tremendous opportunities to continue growing the business. We are building a world-class SaaS company and have raised over $184M from leading venture capitalists including Upfront Ventures, Accel, Silver Lake Waterman, H.I.G. Growth Partners, and Salesforce Ventures.

About the team

Reliability Engineering is Invoca's foundation. We provide the infrastructure, tools, and observability for Invoca to build whatever is needed. We ensure stability today and enable growth for tomorrow.

We’re organized around three major needs:
- Consulting with development teams.
- Core service ownership and building the future of Invoca infrastructure.
- Research & development to keep our skills sharp and stay ahead of the industry.

The SRE Group is responsible for production uptime, observability, and platform reliability. Invoca takes a highly balanced approach to engineer on-call requirements and believes strongly in service ownership, allowing engineering teams to have autonomy and accountability for the amazing things they build.

The position’s reporting structure is:
Engineer -> Senior SRE Manager -> Director, SRE -> CTO -> CEO

About the Role

Our engineers are thoughtful, hard-working, friendly, and curious. We recognize that problem-solvers are everywhere and encourage you to apply if you:

Are curious, thoughtful, and seek to understand first
Understand and apply systems thinking in your day-to-day work
Operate with a customer-focused approach
Enjoy building trust & relationships with your team, your peers, and your colleagues throughout the organization
Understand reliability engineering principles and can advocate for better practices
Want to show up and solve problems

What you will do:

Provide observability for infrastructure and services across the Invoca platform including tools like Prometheus, Grafana, and Kibana
Provide Kubernetes as a service to development teams
Find new and better ways to scale our infrastructure in response to customer (internal and external) needs
Help enable multi-region and international presence to meet developer expectations
Participate in a one-week on-call rotation for services owned by your team
Solve challenging problems presented by the team and the business
Use metrics and your team’s collective experience to drive development decisions

Qualifications

3 years experience in an SRE (or equivalent e.g. sysadmin, software engineer) role
A background in Linux, Docker, and/or Kubernetes
Solid experience with configuration management and infrastructure as code
Critical thinking and problem solving
Exceptional communication skills
A strong sense of accountability

Salary, Benefits & Perks:

Teammates are eligible to begin receiving benefits on the first day of the month following or coinciding with one month of continuous employment. Below are some of our offerings:

Paid Time Off -Invoca encourages a work-life balance for our employees. We have an outstanding PTO policy, starting at 20 days off, for all full-time employees. We also offer 16 paid holidays, 10 days Compassionate Leave, 3 days volunteer time and more.
Healthcare -Invoca offers a health care program that includes medical, dental and vision coverage. There are multiple plan options to choose from so you can make the best choice for yourself, partner and family.
Retirement - Invoca offers a 401(k) plan through Fidelity with a company match of up to 4%.
Stock options - All employees are invited to ownership in Invoca through stock options.
Employee Assistance Program -Invoca offers well-being support on issues ranging from personal matters to everyday life topics through the WorkLifeMatters program.
Paid Family Leave -Invoca offers up to six weeks 100% paid leave for baby bonding, adoption, and caring for family members
Paid Medical Leave - Invoca offers up to twelve weeks 100% paid leave for childbirth and medical need
Sabbatical -We thank our long-term team members with an additional week of PTO along with a bonus after 7 years of service.
Wellness Subsidy - In further support of your well-being,Invoca provides a wellness subsidy that can be applied to a gym membership, fitness classes and more.
Position Base Range -$$127,000.00 - $150,000.00/year, plus bonus potential

Recently, we’ve noticed a rise in phishing attempts targeting individuals who are applying to our job postings. These fraudulent emails, posing as official communications from Invoca aim to deceive individuals into sharing sensitive information. These attacks have attempted to use our name and logo, and have tried to impersonate individuals from our HR team by claiming to represent Invoca.

We will never ask you to send financial information or other sensitive information via email.

DEI Statement

We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender, gender identity or expression, or veteran status. We are proud to be an equal-opportunity workplace.

#LI-Remote

See more jobs at Invoca

30d

EgnyteRemote, India

Full Time ● DevOPS ● golang ● terraform ● Design ● ansible ● api ● docker ● kubernetes ● linux ● python

Egnyte is hiring a Remote Site Reliability Engineer

Description

Site Reliability Engineer

Mumbai, India

EGNYTE YOUR CAREER. SPARK YOUR PASSION.

Egnyte is a place where we spark opportunities for amazing people. We believe that every role has meaning, and every Egnyter should be respected. With 22,000+ customers worldwide and growing, you can make an impact by protecting their valuable data. When joining Egnyte, you’re not just landing a new career, you become part of a team of Egnyters that are doers, thinkers, and collaborators who embrace and live by our values:

Invested Relationships

Fiscal Prudence

Candid Conversations

ABOUT EGNYTE

Egnyte is the secure multi-cloud platform for content security and governance that enables organizations to better protect and collaborate on their most valuable content. Established in 2008, Egnyte has democratized cloud content security for more than 22,000 organizations, helping customers improve data security, maintain compliance, prevent and detect ransomware threats, and boost employee productivity on any app, any cloud, anywhere. For more information, visit www.egnyte.com.

Right now, we are looking for a Site Reliability Engineer. You will be ensuring reliability for large-scale software - we’re talking 22k+ customers, over 6000 instances across geo-distributed Data Centers and Cloud providers, as well as an average of 2k API requests per second as per New Relic. People who own their work from start to finish are integral to Egnyte’s success. Our engineers are part of the whole process: from design through coding and testing to the deployment and back again for further iterations. We are looking for a mid-level engineer eager to apply software development approaches to operations. You can, and will, touch every infrastructure level depending on the day and the project you are working on.

WHAT YOU’LL DO:

Maintain and monitor our environments in a 24/7 rotation system, partial night shift coverage
Improve our monitoring systems, identify repetitive tasks
Cooperate with international teams
Identify performance challenges
Document and communicate progress on resolving issues

YOUR QUALIFICATIONS:

Experience in an SRE/SysAdmin/DevOps or equivalent role - at least +4 years
Practical experience in managing Linux Operating Systems on the administrative level
Solid Monitoring & DevOps skills
Practical knowledge of container orchestration (Kubernetes, Docker)
Familiarity with at least one of the monitoring tools (e.g. Icinga, Newrelic, Prometheus, Grafana, OpenTSDB)
Experience with public cloud services (GCP/AWS/Azure)
Coding skills in Python or Golang
Ability to work effectively in a globally distributed team structure
Drive to grow as a Site Reliability Engineer (we value open-mindedness and a can-do attitude)
Troubleshooting skills to hunt down the root causes of issues and persistence in preventing them from happening again
Experience handling large numbers of diverse systems with configuration management systems like Puppet, Ansible, Terraform
Solid English skills to effectively communicate with other team members (B2 level)

BONUS SKILLS:

Practical Experience using CI/CD tools like Jenkins.
Incident management experience
Experience with Linux HA solutions such as HAProxy

BENEFITS:

Competitive salaries
Company equity depending on role and level
Medical insurance and healthcare benefits for you and your family
Fully paid premiums for life insurance
Flexible hours and PTO
Mental wellness platform subscription
Gym reimbursement
Childcare reimbursement
Group term life insurance

COMMITMENT TO DIVERSITY, EQUITY, AND INCLUSION:

At Egnyte, we celebrate our differences and thrive on our diversity for our employees, our products, our customers, our investors, and our communities. Egnyters are encouraged to bring their whole selves to work and to appreciate the many differences that collectively make Egnyte a higher-performing company and a great place to be.

See more jobs at Egnyte

Site Reliability Engineer II

+30d

Oscar HealthRemote

Design ● c++ ● kubernetes

Oscar Health is hiring a Remote Site Reliability Engineer II

Hi, we're Oscar. We're hiring a Engineer II to join our SRE team.

Oscar is the first health insurance company built around a full stack technology platform and a focus on serving our members. We started Oscar in 2012 to create the kind of health insurance company we would want for ourselves—one that behaves like a doctor in the family.

About the role

As a core member of the Site Reliability Engineering team, you will build reliable and maintainable applications, infrastructure, and interfaces that make interacting with the health care system easier for members and providers. The goal of the team is to create scalable and highly reliable software systems with a focus on building and automating systems that are resilient, fault-tolerant, and self-healing, aiming to bridge the gap between development and operations teams.

You will report to the SRE Staff Engineer.

Work Location:

Oscar is a blended work culture where everyone, regardless of work type or location, feels connected to their teammates, our culture and our mission.

If you live within commutable distance to our New York City office (in Hudson Square), our Tempe office (off the 101 at University Dr), or our Los Angeles office (in Marina Del Rey), you will be expected to come into the office at least two days each week. Otherwise, this is a remote / work-from-home role.

You must reside in one of the following states: Alabama, Arizona, California, Colorado, Connecticut, Florida, Georgia, Illinois, Iowa, Kansas, Kentucky, Maine, Maryland, Massachusetts, Michigan, Minnesota, Missouri, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, Ohio, Oregon, Pennsylvania, Rhode Island, South Carolina, Tennessee, Texas, Utah, Vermont, Virginia, Washington, or Washington, D.C. Note, this list of states is subject to change. #LI-Remote

Pay Transparency:

The base pay for this role $144,000 - $189,000. You are also eligible for employee benefits, participation in Oscar’s unlimited vacation program, company equity grants and annual performance bonuses.

Responsibilities

Consistently write stable, correct, and maintainable code with little oversight; write modular, adaptable code with guidance.
Demonstrate mastery over common uses of available tools, frameworks, libraries, and infrastructure; strong knowledge of available libraries; make judicious choices over what code to write versus what code to import.
Managing Kubernetes Clusters and managing core components.
Work with partners, product managers, and designers to solve challenging problems.
Collaborate with other engineers on the team to improve technology and apply best practices.
Implement step-wise technical migrations of our existing services and applications.
Own small to medium features or infrastructure projects from technical design through completion with little required guidance.
Independent contributor to their team. Work effectively across the codebase with appropriate guidance from code owners.
Make steady, well-paced progress without requiring frequent significant feedback from more senior engineers.
Compliance with all applicable laws and regulations
Other duties as assigned

Qualifications

2+ years of professional software engineering experience, working with a variety of technologies, and have increasingly impactful accomplishments
Experience proposing, experimenting, and iterating, whether it be a new shiny technology or an arcane, ill-conceived data structure; our company may be new, but the health industry isn’t!
Experience with technical contributions, improving the quality of what you create, and are excited to build fault-tolerant, and scalable software systems.
Demonstrates solid understanding of the practical application of CS concepts within their team.

Bonus Points

Experience with Hashicorp Vault.
Experience managing production Kubernetes clusters
Designing a Kubernetes based Platform as a Service

This is an authentic Oscar Health job opportunity. Learn more about how you can safeguard yourself from recruitment fraudhere.

At Oscar, being an Equal Opportunity Employer means more than upholding discrimination-free hiring practices. It means that we cultivate an environment where people can be their most authentic selves and find both belonging and support. We're on a mission to change health care -- an experience made whole by our unique backgrounds and perspectives.

Pay Transparency: Final offer amounts, within the base pay set forth above, are determined by factors including your relevant skills, education, and experience.Full-time employees are eligible for benefits including: medical, dental, and vision benefits, 11 paid holidays, paid sick time, paid parental leave, 401(k) plan participation, life and disability insurance, and paid wellness time and reimbursements.

Reasonable Accommodation:Oscar applicants are considered solely based on their qualifications, without regard to applicant’s disability or need for accommodation. Any Oscar applicant who requires reasonable accommodations during the application process should contact the Oscar Benefits Team (accommodations@hioscar.com) to make the need for an accommodation known.

California Residents: For information about our collection, use, and disclosure of applicants’ personal information as well as applicants’ rights over their personal information, please see our Notice to Job Applicants.

See more jobs at Oscar Health

+30d

iManageRemote

Full Time ● agile ● terraform ● sql ● Design ● azure ● ruby ● c++ ● c# ● .net ● docker ● kubernetes ● linux ● python

iManage is hiring a Remote Site Reliability Engineer

Site Reliability Engineer - iManage - Career PageWriting and designing automation, monitoring, diagnosing, and debugging tooling. 

See more jobs at iManage

+30d

AcquiaRemote - Costa Rica

DevOPS ● 9 years of experience ● 6 years of experience ● 3 years of experience ● terraform ● drupal ● Design ● ansible ● azure ● ruby ● java ● kubernetes ● jenkins ● python ● AWS ● PHP

Acquia is hiring a Remote Senior Site Reliability Engineer

Acquia empowers the world’s most ambitious brands to create digital customer experiences that matter. With open source Drupal at its core, the Acquia Digital Experience Platform (DXP) enables marketers, developers, and IT operations teams at thousands of global organizations to rapidly compose and deploy digital products and services that engage customers, enhance conversions, and help businesses stand out.

Headquartered in the U.S., Acquiais positioned as a market leader by the analyst community and is listed as one of the world’s top software companies by The Software Report. We are Acquia. We are a global company with employees located in more than 30 countries, and we’re building for the future.We want you to be a part of it!

About the role:

As a Senior Site Reliability Engineer, you will be a key player in designing, implementing, and maintaining our CI/CD pipelines, cloud infrastructure, and monitoring solutions. Your expertise in tools like ArgoCD, Kubernetes, and cloud-native architecture will help us achieve operational excellence at scale. You will work closely with engineering teams to ensure they have the right infrastructure in place to deploy rapidly, safely, and reliably.

This is a hands-on role for someone who thrives in an environment where automation is the goal, reliability is the baseline, and scalability is second nature. You won’t just be maintaining systems—you’ll be innovating, designing new ways to make our infrastructure smarter and our development faster.

Job Responsibilities:

CI/CD Pipeline Mastery: Design, build, and optimize continuous integration and continuous deployment (CI/CD) pipelines using ArgoCD, Jenkins, or similar tools. Ensure zero-downtime, fully automated deployment pipelines.
Infrastructure as Code (IaC): Build and manage scalable, reliable infrastructure using Terraform, Kubernetes, and other IaC tools. Ensure everything is automated—from deployments to monitoring—so that infrastructure becomes a self-service platform.
Cloud Expertise: Architect and manage cloud environments (AWS, GCP, or Azure), focusing on cost optimization, scalability, and performance. Implement disaster recovery, fault tolerance, and high availability strategies.
Monitoring and Alerting: Implement comprehensive monitoring solutions using Prometheus, Grafana, ELK, and Datadog to detect and resolve performance bottlenecks before they impact customers. Design and implement automated alerts for proactive system health monitoring.
DevOps Advocacy: Champion the culture of DevOps across teams—promote best practices, encourage adoption of new technologies, and drive a continuous learning mindset within the engineering teams. Be the go-to person for CI/CD, infrastructure scaling, and deployment automation.
SRE Mindset: Focus on building systems that are resilient by design, automating processes that improve reliability, and implementing Service Level Objectives (SLOs) to align engineering efforts with operational goals.
Security-First Approach: Collaborate with security teams to implement robust security practices, from container security to infrastructure hardening. Automate security checks within the pipeline for compliance and vulnerability management.
Collaboration with Engineering Teams: Work hand-in-hand with product development teams to understand their needs, integrate CI/CD practices into their workflows, and provide a fast, reliable, and secure path from code to production.

Skills:

BS in Computer Science or a comparable field of study, or equivalent practical experience.
Experience working with one or more of: Go, Python, Ruby, PHP, Java or Javascript.
Experience with Unix/Linux systems administration using the CLI.
Fundamental understanding of TCP/UDP networking concepts
Solid oral and written communications skills.
CI/CD Expertise: Extensive hands-on experience with CI/CD tools such as ArgoCD, Jenkins, CircleCI, or GitLab CI. Ability to design and implement pipelines that ensure rapid, reliable deployments.
Kubernetes Guru: Strong understanding and experience with Kubernetes, Helm, and container orchestration. Ability to scale and manage microservices in production.
Cloud Mastery: Proficient in at least one major cloud provider—AWS, GCP, or Azure. Experience with multi-cloud or hybrid-cloud architecture is a plus.
IaC Champion: Proficiency in Terraform, Ansible, or CloudFormation to manage infrastructure as code. Familiarity with GitOps workflows and version-controlled infrastructure.
Monitoring & Observability: Strong experience with monitoring tools like Prometheus, Grafana, Datadog, ELK, or New Relic. Ability to build custom dashboards and alerting systems.
Security-Focused: Deep understanding of security best practices in DevOps, including container security, CI/CD pipeline security, and cloud infrastructure hardening.
Problem Solver: Excellent troubleshooting skills with the ability to diagnose issues across a variety of environments, from code to infrastructure.
Collaboration Skills: Ability to work effectively in cross-functional teams, influencing peers and driving adoption of best practices across the organization.

Preferred Qualifications:

5-9 years of hands-on experience as a DevOps Engineer, SRE, or related role in a cloud-native environment.
Proven experience mentoring junior team-members.
Deep knowledge of CI/CD pipelines, especially using ArgoCD or similar tools.
Proven expertise in cloud platforms (AWS, GCP, Azure), with experience building and managing scalable, reliable infrastructure.
Strong coding skills in Python, Go, or Ruby.
Experience with service mesh architectures like Istio or Linkerd is a plus.
SRE Certification (or equivalent experience) is a bonus.
Certified Kubernetes Administrator (CKA) is preferred.
A passion for automation, observability, and reliability.

All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genetic information, military and veteran status, marital status, pregnancy, gender, gender expression, gender identity, sexual orientation, or any other characteristic protected by local law, regulation, or ordinance.

See more jobs at Acquia

Staff Site Reliability Engineer

+30d

AcquiaRemote - Costa Rica

Acquia is hiring a Remote Staff Site Reliability Engineer

About the role:

As a Staff Site Reliability Engineer, you will be a key player in designing, implementing, and maintaining our CI/CD pipelines, cloud infrastructure, and monitoring solutions. Your expertise in tools like ArgoCD, Kubernetes, and cloud-native architecture will help us achieve operational excellence at scale. You will work closely with engineering teams to ensure they have the right infrastructure in place to deploy rapidly, safely, and reliably.

Job Responsibilities:

CI/CD Pipeline Mastery: Design, build, and optimize continuous integration and continuous deployment (CI/CD) pipelines using ArgoCD, Jenkins, or similar tools. Ensure zero-downtime, fully automated deployment pipelines.
Infrastructure as Code (IaC): Build and manage scalable, reliable infrastructure using Terraform, Kubernetes, and other IaC tools. Ensure everything is automated—from deployments to monitoring—so that infrastructure becomes a self-service platform.
Cloud Expertise: Architect and manage cloud environments (AWS, GCP, or Azure), focusing on cost optimization, scalability, and performance. Implement disaster recovery, fault tolerance, and high availability strategies.
Monitoring and Alerting: Implement comprehensive monitoring solutions using Prometheus, Grafana, ELK, and Datadog to detect and resolve performance bottlenecks before they impact customers. Design and implement automated alerts for proactive system health monitoring.
DevOps Advocacy: Champion the culture of DevOps across teams—promote best practices, encourage adoption of new technologies, and drive a continuous learning mindset within the engineering teams. Be the go-to person for CI/CD, infrastructure scaling, and deployment automation.
SRE Mindset: Focus on building systems that are resilient by design, automating processes that improve reliability, and implementing Service Level Objectives (SLOs) to align engineering efforts with operational goals.
Security-First Approach: Collaborate with security teams to implement robust security practices, from container security to infrastructure hardening. Automate security checks within the pipeline for compliance and vulnerability management.
Collaboration with Engineering Teams: Work hand-in-hand with product development teams to understand their needs, integrate CI/CD practices into their workflows, and provide a fast, reliable, and secure path from code to production.

Skills:

BS in Computer Science or a comparable field of study, or equivalent practical experience.
Experience working with one or more of: Go, Python, Ruby, PHP, Java or Javascript.
Experience with Unix/Linux systems administration using the CLI.
Fundamental understanding of TCP/UDP networking concepts
Solid oral and written communications skills.
CI/CD Expertise: Extensive hands-on experience with CI/CD tools such as ArgoCD, Jenkins, CircleCI, or GitLab CI. Ability to design and implement pipelines that ensure rapid, reliable deployments.
Kubernetes Guru: Strong understanding and experience with Kubernetes, Helm, and container orchestration. Ability to scale and manage microservices in production.
Cloud Mastery: Proficient in at least one major cloud provider—AWS, GCP, or Azure. Experience with multi-cloud or hybrid-cloud architecture is a plus.
IaC Champion: Proficiency in Terraform, Ansible, or CloudFormation to manage infrastructure as code. Familiarity with GitOps workflows and version-controlled infrastructure.
Monitoring & Observability: Strong experience with monitoring tools like Prometheus, Grafana, Datadog, ELK, or New Relic. Ability to build custom dashboards and alerting systems.
Security-Focused: Deep understanding of security best practices in DevOps, including container security, CI/CD pipeline security, and cloud infrastructure hardening.
Problem Solver: Excellent troubleshooting skills with the ability to diagnose issues across a variety of environments, from code to infrastructure.
Collaboration Skills: Ability to work effectively in cross-functional teams, influencing peers and driving adoption of best practices across the organization.

Preferred Qualifications:

8-13 years of hands-on experience as a DevOps Engineer, SRE, or related role in a cloud-native environment.
Proven experience mentoring junior team-members.
Deep knowledge of CI/CD pipelines, especially using ArgoCD or similar tools.
Proven expertise in cloud platforms (AWS, GCP, Azure), with experience building and managing scalable, reliable infrastructure.
Strong coding skills in Python, Go, or Ruby.
Experience with service mesh architectures like Istio or Linkerd is a plus.
SRE Certification (or equivalent experience) is a bonus.
Certified Kubernetes Administrator (CKA) is preferred.
A passion for automation, observability, and reliability.

See more jobs at Acquia

Senior Site Reliability Engineer (Bridge) HUN, Budapest, Remote

+30d

LTGBudapest, HU - Remote

Lambda ● jira ● terraform ● slack ● ruby ● typescript ● kubernetes ● AWS

LTG is hiring a Remote Senior Site Reliability Engineer (Bridge) HUN, Budapest, Remote

People Matter Most!

We are a global team of Engineers, Product Managers, Designers, and Program Managers across Hungary, the US, and many other countries. We help our customers create work cultures people love.

About the Product

GetBridge was founded to define, develop, and deploy world-class, easy-to-use software; and that’s what we do and will keep on doing. We make better, more usable tools for teaching, learning and career management, stuff people will actually use. Are you interested?

So here are our questions to you:

Do you have a “Challenge Accepted” attitude?

You belong with us, if you are:

A problem solver who asks questions to get at the core issue the team is grappling with before deciding on a solution and a pragmatist who knows how to make trade-offs to solve challenges while building an architecture that scales for the future.
An owner who is capable of leading and delivering complex projects involving multiple teams while also caring about cloud operations for dozens of services across multiple regions, environments, and language stacks.
A builder who loves implementing automation to reduce toil and enable healthy systems by default and building tools and resources for upskilling other engineering teams to make service creation and maintenance self-service.
A watcher who likes configuring observability systems to identify incidents before they happen, respond to incidents, and contribute to a continuous improvement culture with occasional participation in 24/7 on-call rotations.
A learner who loves to learn new things and improve yourself is encoded in your DNA.
A mentor who supports the development and growth of their colleagues.

Knowledge is power; are you armored?

Here’s our tech stack - what you will learn:

At least one modern programming language (Java/Kotlin, Ruby, React & Typescript)
Cloud-based providers (AWS, Kubernetes, Aurora, EKS, Lambda, Pulsar and Apigee)
Cloud networking configuration (VPCs, security groups, load balancers, DNS, etc).
Configuration-as-a code (Terraform)
System observability (Datadog, Sentry)
CI/CD: GitHub, Spinnaker
CMO: SAFe, JIRA, Confluence, Slack, GSuite

Do you like things to be in balance?

Our offer focuses on your:

Healthy work-life balance: We have a great office in Allee Corner where you are welcome, but there is no mandate to get to work on a regular basis. Our employees enjoy the freedom to manage their working hours.
Personal growth: We want to bring out the best in you through several things, learning days, quarterly hack weeks, LinkedIn Learning, mentorship, career development plan and training opportunities from the first day.
Financial stability:We offer you a competitive salary package (1.7-2.1m gross / month depending on your seniority), bonus (based on the performance of the company), a comprehensive healthcare package provided by Medicover,SZÉP card, and other fringe benefits.

We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, colour, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.

See more jobs at LTG

+30d

Hack TheAlimos,Attica,Greece, Remote Hybrid

terraform ● Design ● mobile ● docker ● kubernetes ● python ● AWS

Hack The is hiring a Remote Site Reliability Engineer

Ready to embark on the quest of joining Hack The Box?

At the end of this thrilling journey, you'll become a proud member of Hack The Box, with the ultimate mission to help cybersecurity professionals and organizations enhance their cyber-attack readiness. Get ready for an exciting adventure into the world of cybersecurity! ????????????

The Core Mission of the Site Reliability Engineer (SRE):
As a Site Reliability Engineer at Hack The Box, your paramount mission is to assist the seamless migration to AWS, strategically positioning our infrastructure to scale effectively with the company. Over the next 6 months, you will participate in enhancing our capabilities for expansion, setting the stage for the addition of new systems such as Kubernetes clusters, Services, and Databases. Additionally, your focus will shift towards establishing key performance indicators, service level objectives, and incident response metrics to drive a culture of reliability and continuous improvement.

The Fellowship You'll Be Joining:
You’ll join a team of 4 SREs, while collaborating closely with engineers, data scientists, and security experts. Finally, you will report directly to the SRE Lead and will have open communications with infrastructure department management and other high-caliber technical people across the organization.

Technology Tools & Weapons You'll Be Using:

Infrastructure as Code (Terraform): Automate the provisioning of AWS resources.
Containerization and Orchestration (Kubernetes, Flux CD): Ensure seamless deployment and scaling of applications.
Monitoring and Logging (Prometheus, Mimir, Grafana, Loki): Expand monitoring capabilities for new systems.
Automation and Scripting (Go, Python, etc): Scripting for efficient and automated processes.
Cloud Platforms (AWS): Execute the migration plan with a focus on AWS.

The Adventures That Await You After Becoming a Site Reliability Engineer at Hack The Box:

Heavily contribute to the AWS Migration for Scalability: Spearhead the migration from the current cloud provider towards AWS, strategically positioning our infrastructure for scalable growth across regions.
Expand Monitoring Stack: Integrate new systems into the Monitoring Stack, enhancing visibility and alerting capabilities for a globally distributed architecture.
Architectural Design for Reliability: Contribute to the design and implementation of reliable AWS infrastructure, focusing on fault tolerance and high availability.
Establish Metrics Framework: Implement and manage Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to measure and improve system reliability.
Incident Response Enhancement: Develop and enhance incident response processes, leveraging metrics to continually improve response times and effectiveness.
Mentorship: Mentor and guide junior SREs in adapting to the AWS environment and implementing reliability best practices.
Collaborative Planning: Work closely with cross-functional teams to plan and implement new systems effectively, ensuring alignment with reliability goals.
Team Expansion: Play a key role in the team's expansion, contributing to the mentoring junior members.
Best Practices Advocacy: Champion best practices in AWS architecture and SRE methodologies, fostering a culture of reliability and continuous improvement.

Skills, Knowledge, and Experience Points Required to Unlock the Role of SRE at Hack The Box:

Hands-on Experience: Minimum 2 years of hands-on experience in site reliability engineering or a related field.
Automation Skills: Proficient in scripting and automation using languages such as Go, Python or Bash.
Cloud Expertise: In-depth knowledge of cloud platforms, particularly AWS.
Containerization: Experience with containerization technologies (Docker) and orchestration (Kubernetes).

Monitoring Mastery: Strong expertise in implementing and managing monitoring and logging solutions.
Metrics Framework: Proven experience establishing and managing SLAs, SLOs, and SLIs.
Problem Solving: Proven ability to troubleshoot complex system issues and implement effective solutions.
Collaborative Mindset: Excellent collaboration and communication skills, with a strong ability to work cross-functionally and mentor junior team members.

????️ What your Hack The Box adventure will have in store:

????You'll have the exhilarating opportunity to contribute to a product that is highly appreciated by users and the cybersecurity community at large.
???? You'll experience a highly supportive and caring environment, fostering growth, flexibility, and autonomy.
???? You'll embark on an exciting journey of continuous learning and problem-solving, leveling up as our organization grows.
???? Most importantly, you'll have a blast at HTB ???? because fun is an essential ingredient in our recipe for success! Just wait until you see our global meet-ups!

???? The gems you’ll be enjoying as a Site Reliability Engineer:

Private insurance
25 annual leave days
Dedicated budget for training and professional development, participation in conferences
State-of-the-art equipment (Macbook, iPhone, and mobile plan)
Free lunch & snacks at the office
Full access to the Hack The Box lab offerings; so you can learn how to hack
Flexible/Hybrid working

????️ The Quest of Becoming Hack The Box’s Site Reliability Engineer:

Level 1: To complete level one’s objective, submit your application.
Level 2: Meet the Talent Acquisition team. Level’s objective: highlight your past achievements, ambitions, and values.
Level 3: Meet the hiring team. Level’s objective: connect with the hiring team and share with them your achievements.
Level 4: Complete an assignment that aligns with day-to-day job-related tasks and responsibilities. Part of the assignment is discussing it with the hiring team in a debriefing session, in order to walk the team through your thinking process.
Level 5: Congratulations! Not many reach this level ????. Level’s objective: have a constructive, final conversation with senior leadership to explore the role and your future at HTB.
Level 6: You've officially received an offer from HTB! To complete the last level and the Quest, all you need to do is accept the offer.
Quest complete. Congratulations, you’re officially one of us ????????????Your next quest: complete the onboarding.

Hack Your Career, Today. Join us in this epic adventure of cybersecurity at Hack The Box! ????????????

At Hack The Box, we are on a quest to find the most exceptional and enthusiastic talent to join our team. Whether or not you consider yourself a gamer, we value what makes you unique and want to know more about you. This job post provides just a glimpse of the incredible gamified experience our business and consumer customers enjoy through our platforms. So, if you're ready to embark on a journey of growth and adventure, we can't wait to meet you!

ABOUT HACK THE BOX

Hack The Box is the Cyber Performance Center with the mission to provide a human-first platform to create and maintain high-performing cybersecurity individuals and organizations. Hack The Box is the only platform that unites upskilling, workforce development, and the human focus in the cybersecurity industry, and it’s trusted by organizations worldwide for driving their teams to peak performance. Offering an all-in-one environment for continuous growth, assessment, and recruitment, Hack The Box provides solutions for all cybersecurity domains.

Launched in 2017, Hack The Box brings together the largest global cybersecurity community of more than 2.6 million platform members. Rapidly growing its international footprint and reach, Hack The Box is headquartered in the UK, with additional offices in the US, Australia, and Greece.

???? Exciting News:

We are super proud to share that HTB’s all three entities across the UK, US, and Greece have been Certified as a Great Place to Work (Oct 2023-Oct 2024).
Furthermore, the HTB's Greek entity has been listed by the Great Place to Work Institute as the #4 Best Workplace in Greece and #7 in Europe for 2023, among more than 3,300 companies????
Get more insights about our HTB culture and employee experience by visiting our career site and Glassdoor.

At Hack The Box, we are committed to fostering a diverse, inclusive, and equitable workplace. We believe that diversity enriches our performance, services, and the communities we serve. As such, we ensure that all job applications are considered solely based on merit, skills, and qualifications. We do not discriminate on grounds of race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. We are dedicated to providing a fair and respectful work environment that reflects our values.

See more jobs at Hack The

Senior Site Reliability Engineer (Turkey)

+30d

SezzleTürkiye, Remote

Sales ● DevOPS ● Bachelor's degree ● terraform ● sql ● Design ● c++ ● docker ● kubernetes ● linux ● python ● AWS

Sezzle is hiring a Remote Senior Site Reliability Engineer (Turkey)

The salary range for this role is $5,000 - $9,200 per month (Gross in USD)

About Sezzle:

With a mission to financially empower the next generation, Sezzle is revolutionizing the shopping experience beyond payments, blending cutting-edge tech with seamless, interest-free installment plans that make shopping smarter and more accessible. We’re not just transforming payments; we’re redefining how people discover, interact with, and purchase the things they love while driving real impact on merchant sales through increased conversions and higher order values. As we continue to shape the future of fintech and retail, we’re building an innovative, dynamic team passionate about creating more than just a transaction but a truly unique shopping journey. If you’re excited about pushing boundaries in tech and delivering a game-changing experience for consumers and merchants alike, come join us at Sezzle and help create the future of shopping!

About the Role:

We are seeking a talented and motivated Senior Site Reliability Engineer who is best in class with a high IQ plus a high EQ. This role presents an exciting opportunity to thrive in a dynamic, fast-paced environment within a rapidly growing team, with abundant prospects for career advancement. In this role you will work on our core Infrastructure and Security team, to assist us with designing, building, running, improving and scaling the infrastructure that engineering and data teams use to power their services. Your duties will include the development, testing, and maintenance of our serving and data platforms, using a combination of cloud products, open source tools and internal applications. Your duties will blend software development and operations in order to continuously automate our environments. You should be able to build high-quality, scalable solutions for a variety of problems.

Compensation

Sezzle is a remote U.S.-based company listed on NASDAQ. Our salary ranges are as follows:

Senior: $7,000 - $9,200 USD per month

Responsibilities:

Design, build and maintain scalable infrastructure for running our systems, based on Kubernetes, Redshift and additional AWS services and products.
Help the product teams quickly build out MVP products to test new solutions on the market.
Maintain and develop monitoring and alerting solutions to improve the on-call experience.
Assist product developers in debugging and triaging production issues.
Be the first line of defense for our operational environments, triaging and resolving problems as they occur. You will be on an on-call rotation.
Design and scale platform and data architectures to sustain rapid user growth.
Level up the teams through pairing, code review, and mentoring.
Bring and share with our team extensive experience with industry best practices in software development.

Minimum Requirements:

Bachelor's in computer science (preferred) or equivalent related experience
At least 5+ years of overall software, data, deployments and platform infrastructure experience.

Ideal Skills & Experience:

Experience with building and/or serving REST APIs using Go or a similar language.
Experience with Relational Databases, SQL and ORM technologies.
Strong overall Linux knowledge.
DevOps experience with CI/CD pipelines, Docker and Kubernetes, and cloud computing platforms like AWS.
Experience with deployment/provisioning tools like Terraform, Helm, Ansible.
Experience with implementing and maintaining observability and monitoring tools - Prometheus, Datadog, NewRelic, Grafana, Loki or similar.
Experience in ETL/ELT pipelines using Python and Open-source tools such as DBT.
Proficiency in building and maintaining large-scale data warehousing technologies such as Redshift.

Sezzle’s Technology Stack:

Languages:Golang, Typescript, Python
Frontend:Typescript - React and React Native
Backend:Golang
Database:MySQL, Postgres, Elasticsearch
DevOps & Cloud:AWS, Kubernetes
Version Control:Git
CI/CD:Gitlab
Testing:Developer-driven, focus on automated unit, integration, and end-to-end tests
Sezzle is focused on using open source, and we build what we can before buying!

About You:

You have relentlessly high standards - many people may think your standards are unreasonably high. You are continually raising the bar and driving those around you to deliver great results. You make sure that defects do not get sent down the line and that problems are fixed so they stay fixed.
You’re not bound by convention - your success—and much of the fun—lies in developing new ways to do things
You need action - speed matters in business. Many decisions and actions are reversible and do not need extensive study. We value calculated risk-taking.
You earn trust - you listen attentively, speak candidly, and treat others respectfully.
You have backbone; disagree, then commit- you can respectfully challenge decisions when you disagree, even when doing so is uncomfortable or exhausting. You have conviction and are tenacious. You do not compromise for the sake of social cohesion. Once a decision is determined, you commit wholly.
You deliver results- you focus on the key inputs and deliver them with the right quality and in a timely fashion. Despite setbacks, you rise to the occasion and never settle.

What Makes Working at Sezzle Awesome:

At Sezzle, we are more than just brilliant engineers, passionate data enthusiasts, out-of-the-box thinkers, and determined innovators. We believe in surrounding ourselves with only the best and the brightest individuals. Our culture is not defined by a certain set of perks designed to give the illusion of the traditional startup culture, but rather, it is the visible example living in every employee that we hire.

#Li-remote

See more jobs at Sezzle

Principal Site Reliability Engineer

+30d

ScienceLogicReston, VA or Remote

DevOPS ● agile ● remote-first ● terraform ● Design ● mobile ● linux ● python ● AWS

ScienceLogic is hiring a Remote Principal Site Reliability Engineer

*This position can be remote within the United States*

Who we are...

In a world of constant change, we're leading the charge towards truly autonomous enterprises. Our cutting-edge platform harnesses the power of automation and generative AI to revolutionize how businesses manage and optimize their IT operations.

We're not just adapting to digital transformation—we're accelerating it. Our solutions bring business and operations leaders together, unlocking new levels of innovation, efficiency, and scalability. We empower organizations to deliver superior customer experiences and drive revenue growth in an always-on, always-mobile world.

At ScienceLogic, we're building the foundation for Autonomic IT—a future where IT operations are self-healing, self-optimizing, and aligned perfectly with business objectives. Our team of visionaries is reshaping the $18+ billion IT operations market, creating cost-optimized, efficient, and next-level capabilities for enterprises worldwide.

ScienceLogic is going through a product transformation and the Site Reliability Engineering (SRE) team is at the forefront of it. We are responsible for the design, deployment, and maintenance of the Cloud Infrastructure used for running company’s revenue generating go-forward SaaS product line. Overall, we’re passionate about automation and solving complex business and technology challenges. Our team combines SRE, DevOps, Software Development and Information Security knowledge to help make Cloud operations agile, elastic inside the security and governance framework boundaries.

What we’re looking for…

We are looking for a Principal Site Reliability Engineer who is well versed in building cloud technologies in a secure manner, has an automation mindset and is an ardent follower of the SRE discipline. If this sounds like you, then our team will benefit from your skillset!

What you’ll be doing…

Enhance the company’s SaaS infrastructure security protocols.
Collaborate across the organization to design, build and operationalize SaaS services conforming to various security standards like FedRAMP, SOC2, ISO etc.
Participate in architecture, security, and operations reviews.
Lead design reviews and buildout of secure systems for delivering various SaaS services with 99.99% uptime.
Design, automate, test, and monitor the use of cloud native technologies as a foundation for a service platform.
Investigate and resolve customer and operational issues with the mentality of fixing and not just mitigating issues.
Identify and automate measurement of operations SLAs and SLOs.
Triage incident response, document SOPs, Runbooks, and train NOC team members
Writing automation that can be easily supported and extended by others.
Work on special projects as assigned.

Qualities you possess…

Here at Site Reliability, we believe that if you are hungry for learning, passionate for technology and like building tools then you are a good fit. Having experience with the skills is an added plus:

Must be a U.S. Citizen.
7-10 years of site reliability engineering or cloud operations experience or equivalent experience.
Proven track record of operating production SaaS environments within security standards like FedRAMP, SOC2, ISO, PCI.
Bachelors or Master's degree in Computer Science, Information Systems or similar field.
Skilled at problem solving, algorithms, and data structures conforming to the modern SaaS security requirements.
Building tools and scripting frameworks from scratch.
Working with Cloud Automation tools like CloudFormation, Terraform, CDK, aws-cli.
Scripting languages like Python, Groovy, PowerShell, Bash, Perl etc.
Exposure to Windows and Linux administration skills.
Familiarity with basic networking, security and cloud engineering concepts.
Highly collaborative with effective written and verbal communication skills.
Ability to work against tight deadlines and occasionally after-hours, part of on-call scheduling.
Occasionally work during off-hours and participate in weekly on-call schedule.
Take full responsibility for the availability and performance of the platform.

Benefits & Perks

A remote-first culture - work from home or come into the office, it's totally up to you.
Comprehensive medical, dental and vision plans.
401(k) plan with employer match.
Flexible Paid Time Off (FTO) so that you can take the time that you need to re-energize.
Volunteer Time Off (VTO) - take two days off per calendar year to volunteer with your preferred charitable organization.
5-year Service Milestone Sabbatical.
Paid parental leave.
Generous employee referral bonus program.
Pet insurance.
HQ Office centrally located in Reston Town Center featuring a well-stocked kitchen with rotating snacks and beverages, and catered lunch on Thursdays.
Regular virtual company-wide events, including cooking classes, yoga, meditation and more.
The opportunity to learn and develop from some of the best and brightest minds in the industry!

Don’t meet every single requirement? Studies have shown that women and people of color are less likely to apply to jobs unless they meet every single qualification. At ScienceLogic, we are dedicated to building a diverse, inclusive and authentic workplace, so if you’re excited about this role but your past experience doesn’t align perfectly with every qualification in the job description, we encourage you to apply anyway. You may be just the right candidate for this or other roles.

All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which you are applying.

About ScienceLogic

ScienceLogic empowers intelligent, automated IT operations, freeing up time and resources, and driving business outcomes with actionable insights. ScienceLogic’s AIOps platform sees broadly across clouds and on-premises, enabling business service visibility with relationship mapping, and workflow automation to eliminate manual tasks. Trusted by thousands of organizations across the globe, ScienceLogic’s technology has been proven for scale by the world’s largest service providers, enterprises and government agencies.

www.sciencelogic.com

All ScienceLogic employees have the responsibility to protect information assets, adhere to access controls, report suspicious activity, and comply with security and privacy policies.

#LI-Remote

See more jobs at ScienceLogic

Site Reliability Engineer (Bridge) HUN, Budapest, Remote

+30d

LTGBudapest, HU - Remote

Lambda ● jira ● terraform ● slack ● ruby ● typescript ● kubernetes ● AWS

LTG is hiring a Remote Site Reliability Engineer (Bridge) HUN, Budapest, Remote

People Matter Most!

We are a global team of Engineers, Product Managers, Designers, and Program Managers across Hungary, the US, and many other countries. We help our customers create work cultures people love.

About the Product

So here are our questions to you:

Do you have a “Challenge Accepted” attitude?

You belong with us, if you are:

A problem solver who asks questions to get at the core issue the team is grappling with before deciding on a solution and a pragmatist who knows how to make trade-offs to solve challenges while building an architecture that scales for the future.
An owner who is capable of leading and delivering complex projects involving multiple teams while also caring about cloud operations for dozens of services across multiple regions, environments, and language stacks.
A builder who loves implementing automation to reduce toil and enable healthy systems by default and building tools and resources for upskilling other engineering teams to make service creation and maintenance self-service.
A watcher who likes configuring observability systems to identify incidents before they happen, respond to incidents, and contribute to a continuous improvement culture with occasional participation in 24/7 on-call rotations.
A learner who loves to learn new things and improve yourself is encoded in your DNA.
A mentor who supports the development and growth of their colleagues.

Knowledge is power; are you armored?

Here’s our tech stack - what you will learn:

At least one modern programming language (Java/Kotlin, Ruby, React & Typescript)
Cloud-based providers (AWS, Kubernetes, Aurora, EKS, Lambda, Pulsar and Apigee)
Cloud networking configuration (VPCs, security groups, load balancers, DNS, etc).
Configuration-as-a code (Terraform)
System observability (Datadog, Sentry)
CI/CD: GitHub, Spinnaker
CMO: SAFe, JIRA, Confluence, Slack, GSuite

Do you like things to be in balance?

Our offer focuses on your:

Healthy work-life balance: We have a great office at MOM Park where you are welcome, but there is no mandate to get to work on a regular basis. Our employees enjoy the freedom to manage their working hours.
Personal growth: We want to bring out the best in you through several things, learning days, quarterly hack weeks, LinkedIn Learning, mentorship, career development plan and training opportunities from the first day.
Financial stability:We offer you a competitive salary package (1.4 - 1.9M HUF gross / month depending on your seniority), bonus (based on the performance of the company), a comprehensive healthcare package provided by Medicover,SZÉP card, and other fringe benefits.

See more jobs at LTG

Site Reliability Engineer II

+30d

Signify HealthDallas, TX, Remote

Design ● mobile ● azure ● c++ ● kubernetes ● python ● AWS

Signify Health is hiring a Remote Site Reliability Engineer II

How will this role have an Impact?

Join Signify Health's vibrant Site Reliability Engineering team as a Site Reliability Engineer. We’re seeking passionate individuals from diverse technical backgrounds. Reporting to the Manager of Site Reliability Engineering, we offer a collaborative environment that values each team member's unique contribution and fosters an inclusive culture.

Your Role:

Developing strategies to improve the stability, scalability, and availability of our products.
Maintain and deploy observability solutions to optimize system performance.
Collaborate with cross-functional teams to enhance operational processes and service management.
Design, build, and maintain application stacks for product teams.
Create sustainable systems and services through automation.

Skills We’re Seeking:

An eagerness to grow and collaborate in the field of Site Reliability Engineering.
Strong familiarity with cloud environments (Azure, AWS, or GCP) and a desire to develop further expertise.
Intermediate understanding of scripting languages, preferably with exposure to Bash or Python, and programming languages, preferably with exposure to Golang.
Novice understanding of infrastructure as code, preferably with exposure to Terraform.
Novice understanding of Kubernetes and containerization technologies.
Novice understanding of CI/CD principles and willingness to guide and enforce best practices.
Novice understanding of Site Reliability and observability principles, preferably with exposure to New Relic.
A proactive approach to identifying problems, performance bottlenecks, and areas for improvement.

The base salary hiring range for this position is $72,100 to $125,600. Compensation offered will be determined by factors such as location, level, job-related knowledge, skills, and experience. Certain roles may be eligible for incentive compensation, equity, and benefits.
In addition to your compensation, enjoy the rewards of an organization that puts our heart into caring for our colleagues and our communities. Eligible employees may enroll in a full range of medical, dental, and vision benefits, 401(k) retirement savings plan, and an Employee Stock Purchase Plan. We also offer education assistance, free development courses, paid time off programs, paid holidays, a CVS store discount, and discount programs with participating partners.

About Us:

Signify Health is helping build the healthcare system we all want to experience by transforming the home into the healthcare hub. We coordinate care holistically across individuals’ clinical, social, and behavioral needs so they can enjoy more healthy days at home. By building strong connections to primary care providers and community resources, we’re able to close critical care and social gaps, as well as manage risk for individuals who need help the most. This leads to better outcomes and a better experience for everyone involved.

Our high-performance networks are powered by more than 9,000 mobile doctors and nurses covering every county in the U.S., 3,500 healthcare providers and facilities in value-based arrangements, and hundreds of community-based organizations. Signify’s intelligent technology and decision-support services enable these resources to radically simplify care coordination for more than 1.5 million individuals each year while helping payers and providers more effectively implement value-based care programs.

To learn more about how we’re driving outcomes and making healthcare work better, please visit us at www.signifyhealth.com

Diversity and Inclusion are core values at Signify Health, and fostering a workplace culture reflective of that is critical to our continued success as an organization.

We are committed to equal employment opportunities for employees and job applicants in compliance with applicable law and to an environment where employees are valued for their differences.

See more jobs at Signify Health

+30d

WebflowU.S. Remote

Sales ● Webflow ● Bachelor's degree ● remote-first ● terraform ● ansible ● mongodb ● c++ ● docker ● typescript ● kubernetes ● python ● AWS ● javascript

Webflow is hiring a Remote Senior Site Reliability Engineer

At Webflow, our mission is to bring development superpowers to everyone. Webflow is a Website Experience Platform (WXP) that empowers modern marketing teams to visually build, manage, and optimize stunning websites. With AI-driven personalization baked in, Webflow enables teams to significantly boost conversion rates, translating directly into measurable business growth. From independent designers and creative agencies to Fortune 500 companies, millions worldwide use Webflow to be more nimble, creative, and collaborative.

We’re looking for a Senior Site Reliability Engineerto improve reliability and stability of Webflow’s customer-facing, production infrastructure, serving millions of page views per hour. Our product is used by over 2 million users world-wide across 190 countries, and you’ll help ensure our platform is secure and scalable for these users as tens of thousands of projects are launched on Webflow each month.

About the role

Location: Remote-first (United States; BC & ON, Canada)
Full-time
Permanent
Exempt
The cash compensation for this role is tailored to align with the cost of labor in different geographic markets. We've structured the base pay ranges for this role into zones for our geographic markets, and the specific base pay within the range will be determined by the candidate’s geographic location, job-related experience, knowledge, qualifications, and skills.

United States (all figures cited below in USD and pertain to workers in the United States)

Zone A: $158,000 - $218,000
Zone B: $149,000 - $205,000
Zone C: $139,00 - $192,000

Canada (All figures cited below in CAD and pertain to workers in ON & BC, Canada)

CAD 180,000 - CAD 248,000

Please visit our Careers page for more information on which locations are included in each of our geographic pay zones. However, please confirm the zone for your specific location with your recruiter.
Reporting to the Engineering Manager

As a Senior Site Reliability Engineer, you’ll …

Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
Participate in and continuously improve on-call and incident response processes.

In addition to the responsibilities outlined above, at Webflow we will support you in identifying where your interests and development opportunities lie and we'll help you incorporate them into your role.

About you

You’ll thrive as a Senior Site Reliability Engineer if you …

Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
Enthusiasm for mentoring and sponsoring less-experienced engineers.

It would be a bonus if you had even one of the following …

Experience with Kubernetes, Nginx, Terraform, or Pulumi specifically.
Experience improving on-call and incident response processes for Engineering.
Experience working in high-compliance environments or a special interest in security engineering. We are not the security team, but we are always looking to improve our security posture!

Our Core Behaviors:

Obsess over customer experience. We deeply understand what we’re building and who we’re building for and serving. We define the leading edge of what’s possible in our industry and deliver the future for our customers
Move with heartfelt urgency. We have a healthy relationship with impatience, channeling it thoughtfully to show up better and faster for our customers and for each other. Time is the most limited thing we have, and we make the most of every moment
Say the hard thing with care. Our best work often comes from intelligent debate, critique, and even difficult conversations. We speak our minds and don’t sugarcoat things — and we do so with respect, maturity, and care
Make your mark. We seek out new and unique ways to create meaningful impact, and we champion the same from our colleagues. We work as a team to get the job done, and we go out of our way to celebrate and reward those going above and beyond for our customers and our teammates

Benefits & wellness

Equity ownership (RSUs) in a growing, privately-owned company
100% employer-paid healthcare, vision, and dental insurance coverage for employees and dependents (full-time employees working 30+ hours per week), as well as Health Savings Account/Health Reimbursement Account, dependent care Flexible Spending Account (US only), dependent on insurance plan selection where applicable in the respective country of employment; Employees may also have voluntary insurance options, such as life, disability, hospital protection, accident, and critical illness where applicable in the respective country of employment
12 weeks of paid parental leave for both birthing and non-birthing caregivers, as well as an additional 6-8 weeks of pregnancy disability for birthing parents to be used before child bonding leave (where local requirements are more generous employees receive the greater benefit); Employees also have access to family planning care and reimbursement
Flexible PTO with a mandatory annual minimum of 10 days paid time off for all locations (where local requirements are more generous employees receive the greater benefit), and sabbatical program
Access to mental wellness and professional coaching, therapy, and Employee Assistance Program
Monthly stipends to support health and wellness, smart work, and professional growth
Professional career coaching, internal learning & development programs
401k plan and pension schemes (in countries where statutorily required) financial wellness benefits, like CPA or financial advisor coverage
Discounted Pet Insurance offering (US only)
Commuter benefits for in-office employees

Temporary employees are not eligible for paid holiday time off, accrued paid time off, paid leaves of absence, or company-sponsored perks unless otherwise required by law.

Remote, together

At Webflow, equality is a core tenet of our culture. We are an Equal Opportunity (EEO)/Veterans/Disabled Employer and are committed to building an inclusive global team that represents a variety of backgrounds, perspectives, beliefs, and experiences. Employment decisions are made on the basis of job-related criteria without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by applicable law. Pursuant to the San Francisco Fair Chance Ordinance, Webflow will consider for employment qualified applicants with arrest and conviction records.

Stay connected

Not ready to apply, but want to be part of the Webflow community? Consider following our story on our Webflow Blog, LinkedIn, X (Twitter), and/or Glassdoor.

Please note:

We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Upon interview scheduling, instructions for confidential accommodation requests will be administered.

To join Webflow, you'll need a valid right to work authorization depending on the country of employment.

If you are extended an offer, that offer may be contingent upon your successful completion of a background check, which will be conducted in accordance with applicable laws. We may obtain one or more background screening reports about you, solely for employment purposes.

For information about how Webflow processes your personal information, please reviewWebflow’s Applicant Privacy Notice.

See more jobs at Webflow