Manager AI System Infrastructure and MLOps Engineering
Company: Promote Project
Location: Redwood City
Posted on: November 13, 2024
Job Description:
Manager AI System Infrastructure and MLOps
EngineeringLocationRedwood City, California, United
StatesSalary30000 - 80000 a year (US Dollars)DescriptionThe TeamThe
AI/ML team is funding and building one of the largest computing
systems dedicated to nonprofit life science research in the world.
This new effort will provide the scientific community with access
to predictive models of healthy and diseased cells, which will lead
to groundbreaking new discoveries that could help researchers cure,
prevent, or manage all diseases by the end of this century.As a
hands-on Manager of the AI System Infrastructure and MLOps
Engineering team, you will be joining the AI/ML and Data
Engineering team in CZI Central Tech, with the responsibility for
the stability and scalable operations of our leading edge GPU Cloud
Compute Cluster. This supports our AI Researchers in their
development and training of state-of-the-art models in artificial
intelligence and machine learning to solve important problems in
the biomedical sciences aligned with CZI's mission, contributing to
greater understanding of human cell function.The OpportunityAs the
Engineering Manager of the AI Infrastructure and MLOps Engineering
team, you will be responsible for a variety of MLOps and AI
development projects that empower our AI Researchers and help to
accelerate Biomedical research across the whole of the AI
lifecycle. You will guide our AI Systems Infrastructure and MLOps
efforts focused on our GPU Cloud Cluster operations, ensuring that
our systems are highly utilized, performant, and stable. You will
be working in collaboration with other members of our own AI
Engineering team as well as the Science Initiative's AI Research
team as they iterate and train their deep learning code, optimizing
systems operations and in helping to troubleshoot problems
encountered by jobs running on the cluster.What You'll Do
- Help to build out the MLOPs and Systems Infrastructure
Engineering team, growing the team to support the large scale
capacity systems and AI training efforts we will be
undertaking.
- Drive our MLOps processes and System Infrastructure Engineering
efforts in ensuring that our GPU Cloud computing systems are highly
utilized and stable, and proactively guide our team in implementing
the instrumentation and observability tooling integral to our AI
Platform.
- Own the on-call efforts for our GPU Cloud computing systems,
building out the MLOps and Systems Infrastructure Engineering
alerting and monitoring efforts for our leading edge Kubernetes
based AI platform, including troubleshooting problems encountered
on the GPU platform infrastructure and with jobs running on the
cluster and computing systems.
- Responsibility for a variety of AI/ML development
infrastructure, instrumentation, and telemetry projects that
empower our team in supporting our users across the AI/ML
lifecycle, taking a key role in simplifying and optimizing the
systems and processes that are integral to our GPU Cloud Cluster
operations - in an MLOps meets SRE kind of hybrid operations
model.
- Mentoring and managing your team in fulfilling their roles to
the best of their abilities, provide skill and career coaching to
help the team members keep growing along their own career and life
paths, and keep the team engaged in meaningful and interesting
projects in service of our north star philanthropic mission.What
You'll Bring
- Hands-on AI/ML Model Training Platform Operations experience in
an environment with challenging data and systems platform
challenges.
- MLOps experience working with medium to large scale GPU
clusters in Kubernetes, HPC environments, or large scale Cloud
based ML deployments (Kubernetes Preferred).
- BS, MS, or PhD degree in Computer Science or a related
technical discipline or equivalent experience.
- 2+ years of experience managing MLOps teams.
- 7+ years of relevant coding and systems experience.
- 7+ years of systems Architecture and Design experience, with a
broad range of experience across Data, AI/ML, Core Infrastructure,
and Security Engineering.
- Strong understanding of scaling containerized applications on
Kubernetes or Mesos, including solid understanding of AI/ML
training with containers using secure AMIs and continuous
deployment systems that integrate with Kubernetes or Mesos.
(Kubernetes preferred).
- Proficiency with Amazon Web Services (AWS), Google Cloud
Platform (GCP), or Microsoft Azure, and experience with On-Prem and
Colocation Service hosting environments.
- Solid coding ability with a systems language such as Rust,
C/C++, C#, Go, Java, or Scala.
- Extensive experience with a scripting language such as Python,
PHP, or Ruby (Python Preferred).
- Working knowledge of Nvidia CUDA and AI/ML custom
libraries.
- Knowledge of Linux systems optimization and
administration.
- Understanding of Data Engineering, Data Governance, Data
Infrastructure, and AI/ML execution platforms.
- PyTorch, Karas, or Tensorflow experience a strong nice to
have.CompensationThe Redwood City, CA base pay range for this role
is $214,000 - $321,000. New hires are typically hired into the
lower portion of the range, enabling employee growth in the range
over time. Actual placement in range is based on job-related skills
and experience, as evaluated throughout the interview process. Pay
ranges outside Redwood City are adjusted based on cost of labor in
each respective geographical market. Your recruiter can share more
about the specific pay range for your location during the hiring
process.Benefits for the Whole YouWe're thankful to have an
incredible team behind our work. To honor their commitment, we
offer a wide range of benefits to support the people who make all
we do possible.
- CZI provides a generous 100% match on employee 401(k)
contributions to support planning for the future.
- Annual funding for employees that can be used most meaningfully
for them and their families, such as housing, student loan
repayment, childcare, commuter costs, or other life needs.
- CZI Life of Service Gifts are awarded to employees to "live the
mission" and support the causes closest to them.
- Paid time off to volunteer at an organization of your
choice.
- Relocation support for employees who need assistance moving to
the Bay Area.We believe that the strongest teams and best thinking
are defined by the diversity of voices at the table. We are
committed to fair treatment and equal access to opportunity for all
CZI team members and to maintaining a workplace where everyone
feels welcomed, respected, supported, and valued. Learn about our
diversity, equity, and inclusion efforts.If you're interested in a
role but your previous experience doesn't perfectly align with each
qualification in the job description, we still encourage you to
apply as you may be the perfect fit for this or another role.
#J-18808-Ljbffr
Keywords: Promote Project, Redwood City , Manager AI System Infrastructure and MLOps Engineering, Executive , Redwood City, California
Didn't find what you're looking for? Search again!
Loading more jobs...