Hi there 👋

Welcome to my blog!

Distributed communication for GPUs (part 2)

Introduction to collective communication operations used for distributed training.

September 13, 2025 · 13 min · 2567 words

Distributed communication for GPUs (part 1)

Introduction to distributed communication for GPUs.

September 9, 2025 · 11 min · 2147 words

Authenticating AWS with EKS

How to authenticate EKS workloads with AWS services?

August 16, 2025 · 3 min · 584 words

Choosing a batch size and provider for LLM training

Notes on choosing appropriate batch size and compute for training LLMs

June 27, 2025 · 4 min · 756 words

Ultra-scale Playbook - Deepspeed ZeRO

Notes on training LLMs using sharding strategies

June 21, 2025 · 8 min · 1521 words