Machine-Learning Engineer · Sarvam AI / frontier coding models

Ritvik
Aryan
Kalra

I build the full stack that trains coding agents — sandboxes, environments, data, rewards, training.

ritvikkalra2000@gmail.com github.com/rvk7895 linkedin.com/in/rvk7895 TRAINING LOGS →

Machine-Learning Engineer · Sarvam AI / frontier coding models

Ritvik Aryan Kalra

I build the full stack that trains coding agents — sandboxes, environments, data, rewards, training.

Contact ritvikkalra2000@gmail.com github.com/rvk7895 linkedin.com/in/rvk7895 TRAINING LOGS → Download CV

00 / Orientation

The whole
training stack. Sandboxes, environments, data, rewards, training — for Sarvam's coding models. Read top to bottom; take your time.

01

Current work

Sarvam AI · 2025–

The end-to-end loop behind Sarvam's coding models — execution fleet underneath, the task pipeline on top, training and measurement closing the loop.

01

Sandbox
infra

Self-built execution fleet for agent rollouts — thousands of concurrent sandboxes.

1000s of concurrent sandboxes · self-built fleet

fleet · live (illustrative)runningpassfail
run pass fail idle

Every agent rollout in the building lands on this floor — a self-hosted execution fleet scaling to thousands of concurrent sandboxes, warm and placement-fast. Capacity, images, observability, and the on-call: owned end to end.

sandboxes at scaleself-hostedfleet observability

02

The task
factory

Raw software work to training-grade tasks, automatically.

build_tasks(raw_swe) → envs · data · rewards — one automated pipeline

aEnvironments

Real software work becomes runnable, graded RL environments — automatically, at scale.

the line · illustrativeraw work → graded
bCuration

Synthetic generation plus ruthless filtering — most candidate data is cut, on purpose.

the gate · illustrativeseen → keptmost cut
cRewards

Every reward signal is stress-tested against gaming before it shapes a gradient.

the audit · illustrativeaudited → shipped / cut

One automated line from raw software work to training-grade tasks: environment construction, synthetic data and curation, and reward signals stress-tested against gaming before they ever shape a gradient. This is the factory the run trains on.

environments at scalesynthetic + curationanti-gaming rewards

03

Training

The full post-training ladder, run end to end — SFT through on-policy RL.

post-training, end to end · pip-sql-1.3b (’23) → Sarvam coding models

the descent, liveloss ↓

The full post-training ladder — SFT through on-policy RL — run end to end, landing meaningful gains on internal benchmarks. The arc runs from pip-sql-1.3b in ’23, a 1.3B model matching models 7× larger, to the coding models behind Sarvam’s agents today. Owning data → rewards → weights means a regression gets chased to its source.

SFT → on-policy RLdata → weightspip-sql → Sarvam

04

Evaluation

The measurement layer that keeps every reported gain honest.

benchmarking infra · per-trace forensics · Samvaad V2V evals

per-trace replay · illustrativetrace forensics

Eval infrastructure for benchmarking the coding models — rollouts at scale, per-trace diagnosis of where an agent got stuck, integrity checks against contamination — plus Samvaad, evals for voice-to-voice agents. A number that can’t be defended doesn’t ship.

benchmarking infraper-trace forensicsSamvaad · V2V

The road here

2022 → now. Everything above stands on this — voice agents on live traffic, a founding-MLE model, and the systems work underneath it all.

02

Before this

2022–25

The experience the training stack is built on, most recent first.

01Sarvam — voice agents’24–25
telephony-scale traffic · live V2V platform
live traffic · illustrativevoice-to-voice

Tuned the agentic LLM behind a voice-to-voice platform serving production, telephony-scale traffic, applying GEPA on production traffic and building the Samvaad eval and monitoring that made that safe.

telephony-scale V2VGEPA on prodSamvaad evals
02Pipable — founding MLE’23–24
pip-sql-1.3b ≈ models 7× larger
pip-sql-1.3bNL → SQL

Founding ML engineer. Led pip-sql-1.3b — an NL→SQL model built with RL and deep learning, matching models 7× larger — and shipped pip-library-etl-1.3b, turning codebases into retrievable, model-ready context.

pip-sql-1.3bNL → SQL · RL7× smaller
03IIIT Hyderabad — systems’22–23
Slurm GPU cluster · Sprinklr internship
slurm · illustrativeIIIT-H GPU cluster

Student sysadmin running IIIT-H’s Slurm GPU cluster for ML workloads, plus a reverse-proxy still used by alumni worldwide and a course-management migration. At Sprinklr: a test-analytics pipeline used across a large engineering org and Kafka health monitoring.

Slurm / GPUreverse-proxyKafka / Elasticsearch
03

Research

Algorithmic fairness · 2023–

Peer-reviewed work on algorithmic bias and gender disparities in recommendation systems.

P1Exploring Gender Disparities in Bumble’s Match Recommendations
SIG GlobDev Pre-ICIS 2023 · arXiv:2312.09626

A mixed-methods study of gendered disparities in match recommendations on a large dating platform, combining quantitative analysis of recommendation outcomes with qualitative reading of how the system treats users differently by gender. Presented at the SIG GlobDev Pre-ICIS Workshop 2023.

P2Unveiling Algorithmic Bias and Bridging Gender Disparities: Case Studies from a Gaming and a Dating Platform in India
IJGS · in press

A pair of algorithmic-bias case studies from a gaming platform and a dating platform in India, arguing that the same structural bias patterns recur across very different products and pointing toward mitigation. International Journal of Gender Studies (IJGS) — in press.

04

Off the clock

the human behind the stack
Off the clock
🎹 keytar ⌨️ mechanical keyboards 🎲 board games coffee

I play the keytar, build mechanical keyboards, lose at board games, and take coffee too seriously. I also keep a blog — coding practices, Effective Java notes, and a running keyboard build log.

The run · reward × step

steps 0 → 50k, illustrative. The work read as one training run — quality rose over the years, so the reward climbs. Each milestone on the right lights up as the curve reaches its checkpoint.

reward ↑ step →
step 0 / 50,000 reward 0.06 ○ training

The story · oldest → newest

~13 seconds · auto-plays · oldest → newest

accent