Click here to join our community of experts to get information on job search, salaries and more.

Harnham

Senior Machine Learning Engineer

Company: Harnham

Location: Remote

Posted on: June 16

Title: Sr. Machine Learning Engineer

Location: Fully remote or optional hybrid in South SF - Strong preference to PST or MST.

Overview of Role:

We are seeking an ML Engineer to lead the development and maintenance of all machine learning (ML) cloud infrastructure, supporting large-scale training and inference processes. This role involves building and managing GPU clusters in the cloud, ensuring scalability, and contributing to distributed training and inference systems. The ideal candidate will work closely with the ML and Engineering teams to create an ML Platform that supports rapid experimentation and deployment at scale, enabling transformative research and impactful solutions.

Company Description:

Our client focuses on cancer research that leverages AI and produces all its own data in house. They are a Series A startup and recently wrapped up a recent round of funding.

Key Responsibilities

  • Build and Maintain Infrastructure: Develop, operate, and maintain GPU clusters in the cloud (AWS), optimizing them for large-scale ML training and inference workloads.
  • Distributed Systems: Operate and enhance systems for distributed training and inference of ML models.
  • Collaborate Across Teams: Work with the ML team to define requirements for analysis of model results, and with the Engineering team to build scalable solutions for rapid access and analysis of large-scale inference results.
  • ML Platform Development: Collaborate on the design and implementation of an ML platform that supports both fast research iterations and seamless scaling for production.

Skills and Experience

  • Years of Experience: Minimum 3+ years of industry experience in ML or related roles.
  • Industry Expertise: No specific industry required, but a strong technical foundation is essential. Candidates interested in working for a cancer therapeutics company will find this role especially rewarding.

Technical Skills

  • AWS Expertise: Experience with AWS, particularly using SageMaker with hyperpod.
  • GPU Clusters: Hands-on experience managing large GPU clusters in the cloud.
  • Kubernetes: Proven experience with cloud infrastructure managed using Kubernetes.
  • Frameworks: Familiarity with distributed computing frameworks such as Ray is highly desirable (other frameworks are acceptable).
  • Data Optimization: Experience optimizing data loaders is a plus.
  • Scaled Deployment: Proven experience deploying systems for scaled inference.
  • PyTorch: Experience with PyTorch data loaders is a big plus
  • Infrastructure as Code (IaC) experience is good to have - especially tools like Terraform.

Key Attributes

  • Strong Communication: Excellent communication skills to effectively interact with technical and non-technical stakeholders in a fast-paced startup environment.
  • Collaborative and Adaptable: Ability to collaborate with diverse team members across ML, Engineering, and leadership teams.

Benefits:

  • Full healthcare coverage for yourself and all dependents
  • Great parental leave program
  • 401(k) Matching

All applicants must be fully authorized to work in the U.S. to qualify. Our client is unable to accommodate any type of sponsorship at this time.