Site Reliability Engineer (SRE) / Production Engineer (PE) - Kubernetes & Cloud Infrastructure
Company: Fireworks
Location: Redwood City
Posted on: February 18, 2025
Job Description:
About Us:Here at Fireworks, we're building the future of
generative AI infrastructure. Fireworks offers the generative AI
platform with the highest-quality models and the fastest, most
scalable inference. We've been independently benchmarked to have
the fastest LLM inference and have been getting great traction with
innovative research projects, like our own function calling and
multi-modal models. Fireworks is funded by top investors, like
Benchmark and Sequoia, and we're an ambitious, fun team composed
primarily of veterans from Pytorch and Google Vertex AI.The
Role:We're seeking a highly skilled SRE/PE with deep expertise in
Kubernetes (k8s), cloud networking, and infrastructure automation.
This role will focus on reducing incident response time,
implementing auto-remediation, optimizing auto-scaling, and
improving cluster efficiency and service health. You'll design
systems that balance performance, cost, and reliability while
working onsite with our Redwood City team.Key Responsibilities:
- Incident Response & Reliability Engineering:
- Drive initiatives to reduce incident response time through
improved monitoring, alerting, and automated remediation.
- Build self-healing systems and playbooks for common failure
scenarios.
- Lead blameless post-mortems and implement preventative
measures.
- Kubernetes & GPU Cluster Optimization:
- Manage and optimize GPU-enabled Kubernetes clusters for AI/ML
workloads, focusing on cost-performance efficiency, auto-scaling,
and resource utilization.
- Debug performance bottlenecks in distributed systems (e.g.,
network, storage, GPU scheduling).
- Cloud Networking & Service Health:
- Strengthen service health by refining cloud networking stacks
(VPCs, load balancers, service meshes) and ensuring low-latency
communication.
- Design fault-tolerant architectures to minimize downtime.
- Monitoring & Observability:
- Enhance service monitoring with tools like Prometheus, Grafana,
and custom metrics pipelines.
- Implement predictive analytics to proactively address system
health risks.
- Automation & Infrastructure-as-Code (IaC):
- Build automation for cluster provisioning, scaling, and
recovery using Terraform, Argo, and CI/CD pipelines.
- Develop tools to streamline operational workflows (e.g.,
automated rollbacks, canary deployments).Minimum Qualifications:
- 3+ years in SRE/PE/DevOps roles with production-grade
Kubernetes experience.
- Proficiency in cloud networking (AWS/GCP/Azure VPCs, firewalls,
DNS) and service monitoring (Prometheus, Alertmanager,
Grafana).
- Hands-on experience with incident management and improving
system reliability/SLOs.
- Strong scripting/coding skills (Python/Go/Bash) for automation
and tooling.
- Familiarity with object storage (S3, GCS) and data pipeline
integration.Preferred Qualifications
- Experience with GPU clusters (NVIDIA GPUs, MIG, CUDA) and AI/ML
workloads.
- Knowledge of auto-scaling technologies (K8s HPA/VPA) and
auto-remediation frameworks.
- Expertise in service meshes (Istio)Why Fireworks AI?
- Solve Hard Problems: Tackle challenges at the forefront of AI
infrastructure, from low-latency inference to scalable model
serving.
- Build What's Next: Work with bleeding-edge technology that
impacts how businesses and developers harness AI globally.
- Ownership & Impact: Join a fast-growing, passionate team where
your work directly shapes the future of AI-no bureaucracy, just
results.
- Learn from the Best: Collaborate with world-class engineers and
AI researchers who thrive on curiosity and innovation.
#J-18808-Ljbffr
Keywords: Fireworks, Redwood City , Site Reliability Engineer (SRE) / Production Engineer (PE) - Kubernetes & Cloud Infrastructure, Professions , Redwood City, California
Didn't find what you're looking for? Search again!
Loading more jobs...