Teaching a Neural Network to Play Captains Mode

I've been playing Dota 2 on and off for years. It's a game where ten players pick heroes in an alternating draft, and the composition of those ten heroes can determine the outcome before the game even starts. Ancient Apparition counters Alchemist. Silencer ruins Enigma's day. A team of five melee heroes against an Earthshaker is asking for trouble.

The problem is, I'm not a professional analyst. When it's my turn to pick and I'm staring at 128 heroes (we just got the new frog hero this year, Largo), my brain does a poor job of computing the optimal choice given the current draft state, the enemy's likely counters, and my team's needs. I usually just slam Invoker down mid and hope for the best.

So, I decided to build a system that could do this for me. A machine learning model that, given a partial draft, could recommend the statistically best hero to maximize my team's winning odds. How hard could it be?

Part 1: The Data Collection Treadmill

Every ML project starts the same way: you need data. Lots of it. Fortunately, OpenDota provides a public API with match data. The plan was simple: fetch matches, store them in a database, and use them for training.

The first version of my data collection script was... functional. It worked, but it was the kind of code you write when you just want to see if something is possible. Single-row database inserts. No duplicate handling. No way to resume if interrupted. It was a prototype that had overstayed its welcome.

I rewrote the entire pipeline from scratch with a three-layer architecture: a generic database wrapper, a schema definition layer, and the collection script itself. The new version could backfill gaps, batch insert 100 rows at a time (100x faster), and handle rate limiting with exponential backoff. I let it run for a few days and collected 226,520 matches. That's 2.2 million training examples after augmentation. More than enough.

Part 2: The Architecture Rabbit Hole

With data in hand, I needed to decide on a model architecture. This is where things got interesting.

The naive approach would be to treat this as a simple classification problem: given a draft, predict which hero wins. But that's not what I wanted. I needed a ranked list of recommendations, not just a single answer.

After some brainstorming (and a healthy dose of overthinking), I settled on a Win Probability Estimator. The model would predict P(Radiant wins | draft state). At inference time, I'd simulate adding each unpicked hero to the draft, predict the win probability, and rank them. It's the same principle behind AlphaGo's value network for board evaluation, and it felt right.

"The architecture I ended up choosing was a Transformer. Hero IDs are embedded (with position embeddings to encode team membership) and passed through a multi-head attention mechanism. The core challenge of evaluating a Dota draft is that hero value is entirely contextual: you need to capture both intra-team synergies (how well your five heroes combo together) and inter-team counters (how well they shut down the enemy). The key insight is that a single Transformer over all 10 slots handles this naturally. The attention mechanism allows heroes on Radiant to attend to their own teammates to model synergy, while simultaneously attending across the aisle to heroes on Dire to evaluate counterpicking potential, all before decoding into a single win probability."

One design decision I'm particularly proud of: partial draft handling. The model needs to work with anywhere from 1 to 9 heroes picked. The solution was elegant: use a fixed 10-slot representation with zeros for empty slots, and mask them out during attention. The model learns to ignore the padding and focus only on picked heroes. It's the same trick BERT uses for variable-length text. This is a trick that I picked up while re-implementing the original BERT model at one of my past internships.

Part 3: The Hero ID Incident

With the architecture designed and the training pipeline ready, I hit "run" and waited for the magic to happen.

CUDA error: device-side assert triggered Assertion `srcIndex < srcSelectDimSize` failed

Ah. The classic "your data doesn't fit your model" error.

I had set N_HEROES = 150, assuming Dota 2 had around 128 heroes with some buffer for future releases. But when I actually checked the data, the maximum hero ID was 155. Dota 2 hero IDs aren't consecutive. There are gaps. Hero ID 24 doesn't exist. IDs 115-118 are missing. The game has 127 unique heroes, but they're scattered across a range of 155 IDs.

This led to an interesting design conversation: how do you future-proof a model for a game that releases new heroes every year?

The solution I went with was to set N_HEROES = 200, giving a buffer of 45 IDs. It's an engineer's solution. When a new hero is released, the model will work immediately (though the new hero's embedding will be randomly initialized). For better performance, I built a transfer learning utility that can expand the embedding layer while preserving all learned weights, allowing fine-tuning instead of retraining from scratch.

It's a small detail, but it's the kind of thing that separates a weekend prototype from a system you can actually use for years.

Part 4: The NaN Incident

With the hero ID issue fixed, training started successfully. For about 3,114 batches.

Warning: NaN loss detected at batch 3114 Logits range: [nan, nan] positions = torch.arange(10, device=self.device).unsqueeze(0).expand(batch_size, -1) RuntimeError: CUDA error: device-side assert triggered

NaN (Not a Number) in a loss function is the machine learning equivalent of a kernel panic. It means something, somewhere, has gone catastrophically wrong with the math.

I added debugging to trace the issue. The embeddings were fine. The combined embeddings were fine. But after passing through the Transformer, the output was NaN. This is a known issue with Transformers: if gradients explode during backpropagation, the weights can become infinite, and everything downstream becomes NaN.

The fix was twofold: gradient clipping (cap gradients at a maximum norm of 1.0) and moving layer normalization to before the self-attention and feed-forward blocks -- a classic architectural pivot known as Pre-LN. The gradient clipping prevents the initial explosion, and Pre-LN keeps values in a stable range, preventing the exact exploding gradient issue I hit. With these changes, the model successfully processed the exact same batch without destabilizing, and training continued smoothly

The final epoch 1 results were... humbling:

train_loss=0.6926 (close to the mathematical random baseline of 0.693)
train_acc=0.5300
val_loss=0.6912

At first glance, 53% accuracy looks better than random guessing. Then I ran a quick SQL query on my dataset and realized the baseline Radiant win rate is exactly 53%. After one full epoch, my state-of-the-art neural network had essentially just learned to always guess "Radiant wins." Machine learning is truly majestic.

Part 5: The Philosophy of Partial Drafts

One of the more subtle challenges that I had forgot to mention was data augmentation. I only have complete drafts (10 heroes), but I need the model to handle partial drafts (1-9 heroes). How do you train for that?

The solution is conceptually simple but philosophically interesting: generate partial draft examples from complete matches. For each match, create 10 training examples representing different draft states (1 pick, 2 picks, ..., 10 picks), all with the same label (the final outcome).

This is sound because the match outcome is determined by the complete draft. Partial drafts are just incomplete observations of that complete draft. The label should be valid for all intermediate states.

The result: 10x data augmentation, and the model learns to handle partial drafts during training. Crucially, to prevent data leakage where 10 highly correlated partial drafts from the same match end up in the same training batch, I shuffle the augmented dataset globally across all epochs. This ensures the model actually learns generalizable draft states rather than memorizing specific game outcomes. No distribution shift between training and inference.

We're not done yet. The model is still being trained and there should be a part 2 coming out soon.

Takeaways

At the end of the day, building this system wasn't really about climbing the Dota 2 ranked ladder. If I'm being honest, I only play occasionally these days, usually just queuing up for social matches with friends. Building a Pre-LN Transformer with cross-team attention to optimize our casual pub games is, by any reasonable metric, completely over-engineered.

But that was exactly the point. The joy of this project wasn't just in the game; it was in the architecture. It was about designing a system that respects the interconnected mechanics of the domain, building a robust pipeline for the edge cases, and wrangling the math into submission. The next step is to wrap the model and serve it locally as an inference API endpoint on my desktop GPU, just to complete the overkill.

Having this neural network won't guarantee us a win the next time my friends and I queue up. But at the very least, when we inevitably draft a terrible lineup, I'll have the statistical proof to mathematically confirm it before the game even starts.