Ir à oferta completa

MACHINE LEARNING ENGINEER (DISTRIBUTED TRAINING)

Descrição da oferta de emprego

Who we are.
CloudWalk is a fintech company reimagining the future of financial services.
We are building intelligent infrastructure powered by AI, blockchain, and thoughtful design.
Our products serve millions of entrepreneurs across Brazil and the US every day, helping them grow with tools that are fast, fair, and built for how business actually works.
Learn more at cloudwalk.
o.
Who We’re Looking For.
We’re looking for a Machine Learning Engineer to own and evolve our distributed training pipeline for large language models.
You’ll work inside our GPU cluster to help researchers train and scale foundation models using frameworks like Hugging Face Transformers, Accelerate, DeepSpeed, FSDP, and others.
Your focus will be distributed training.
from designing sharding strategies and multi-node orchestration to optimizing throughput and managing checkpoints at scale.
This role is not research - it's about building and scaling the systems that let researchers move fast and models grow big.
You’ll work closely with MLOps, infra, and model developers to make our training runs efficient, resilient, and reproducible.
\ What You'll Do.
Own the architecture and maintenance of our distributed training pipeline; Train LLMs using tools like DeepSpeed, FSDP, and Hugging Face Accelerate; Design and debug multi-node/multi-GPU training runs (Kubernetes-based); Optimize training performance.
memory usage, speed, throughput, and cost; Help manage experiment tracking, artifact storage, and resume logic; Build reusable, scalable training templates for internal use; Collaborate with researchers to bring their training scripts into production shape.
What We’re Looking For.
Expertise in distributed training.
Experience with DeepSpeed, FSDP, or Hugging Face Accelerate in real-world multi-GPU or multi-node setups; Strong PyTorch background.
Comfortable writing custom training loops, schedulers, or callbacks; Hugging Face stack experience.
Transformers, Datasets, Accelerate - you know the ecosystem and how to bend it; Infra literacy.
You understand how GPUs, containers, and job schedulers work together.
You can debug cluster issues, memory bottlenecks, or unexpected slowdowns; Resilience mindset.
You write code that can checkpoint, resume, log correctly, and keep running when things go wrong; Collaborative builder.
You don’t mind digging into other people’s scripts, making them robust, and helping everyone train faster.
Bonus Points.
Experience with Kubernetes-based GPU clusters and Ray; Experience with experiment tracking (MLflow, W&B); Familiarity with mixed precision, ZeRO stages, model parallelism; Comfort with CLI tooling, profiling, logging, and telemetry; Experience with dataloading bottlenecks and dataset streaming.
How We Hire.
Online assessment.
technical logic and fundamentals (Math/Calculus, Statistics, Probability, Machine Learning/Deep Learning, Code) Technical interview.
deep dive into distributed training theory and reasoning (no code) Cultural interview If you are not willing to take an online quiz, do not apply.
\ If you’ve trained LLMs before - or helped others do it better - this role is for you.
Even if you don’t check every box, if you’re confident working with distributed compute and real-world LLM workloads, we want to hear from you.
Ir à oferta completa

Detalhes da oferta

Empresa
  • CloudWalk
Localidade
  • Em todo o Brasil
Endereço
  • Indeterminado - Indeterminado
Tipo de Contrato
  • Indeterminado
Data de publicação
  • 16/08/2025
Data de expiração
  • 04/01/2026
Data Engineer
JP&F Consultoria de RH e Gestão de Pessoas

Excellent communication skills — you should be able to clearly articulate your decisions... formação: superior completo horário: comercial benefícios: vale alimentação/vale refeição; plano de saúde; plano odontológico; seguro de vida; gympass; vale transporte; auxílio creche ou auxílio filhos com deficiência......

Administrative Department , Technical Department,
HERIGULF OIL AND GAS PLC

Roustabouts / roughnecks, welders / mechanics, rig operators / drillers, engineers (petroleum and mechanical) health and safety officer, superintendent smp, smp supervisor, document controller clerk, community superintendent, training / hrd superintendent; smp engineer, mine engineer, mine surveyor,......

Administrative and Technical Assistant
The Geodes Oil and Gas Plc

Roustabouts / roughnecks, welders / mechanics, rig operators / drillers, engineers (petroleum and mechanical) health and safety officer, superintendent smp, smp supervisor, document controller clerk, community superintendent, training / hrd superintendent; smp engineer, mine engineer, mine surveyor,......

Job vacancy open at royal crescent hotel & spa UK
Royal crescent hotel & spa

As part of our commitment to our people, the royal crescent hotel & spa is proud to sacrifice a pay and benefit plan , including: full hotel and company induction well-tailored uniforms performance related bonus scheme free breakfast in our hotel staff incentive scheme up to 33 days paid holidays per......

Business Partner de Recursos Humanos (HRBP)
JP&F Consultoria de RH e Gestão de Pessoas

Training and development: identify training needs and provide coaching, actively participating in evaluating and monitoring training programs to ensure their success... legal compliance expertise: maintain an in-depth understanding of legal requirements to mitigate risks, collaborating with the legal......

Jr_036877 technical support intern latam
Resmed

Training experiences a plus... if you're passionate about learning and growing in a fast-paced environment, this role will provide you with the perfect platform to build your career! your key responsibilities include: the intern will have the opportunity to understand technical support and customer......

Actuary (Latin America Region)
JP&F Consultoria de RH e Gestão de Pessoas

You will be responsible for delivering training and consultancy across our latin american client base – extensive travel will be required... horário: comercial benefícios: vale transporte, assistência médica / medicina em grupo, tíquete refeição, assistência odontológica, seguro de vida em grupo, participação......

CUSTOMER SUPPORT REP AND SALES
https://responselink.ai/

Participate in training sessions and team meetings to enhance your skills and stay informed about company updates... keep accurate records of customer interactions, transactions, comments, and complaints in our customer relationship management (crm) system... responselink offers competitive compensation......