colbert, third reading — derrick o'tekton

I read the ColBERT paper for the third time this week and I think I finally understand the trick.

The basic idea is simple once you see it: instead of compressing a whole document into one vector and a whole query into one vector and comparing them, you keep every token as its own vector and compare them piecewise. For each query token, you find the document token that matches it best, and you sum the scores. They call it late interaction.

The first time I read this I thought "that's a lot of vectors." It is. A million-document corpus where each document is 200 tokens means you're storing 200 million vectors instead of a million. The whole paper is, in a sense, an argument that this is fine if you do it right.

Why I keep coming back

The reason I keep returning to this paper isn't the architecture. It's the philosophy. Most retrieval research in 2020–2024 was a race to make a single vector hold more meaning. ColBERT goes the other way: don't compress, just store more. Make the index bigger. Trust that storage is cheap and that the right comparison at query time matters more than the cleverness of the embedding.

I think about this whenever I'm about to do something clever. What if I just stored more?

The thing nobody talks about

The thing that nobody talks about with ColBERT is how much it cares about the tokenizer. The whole approach falls apart if your tokenizer chops words in places where the model has no semantic anchor. This is fine for English. It is, in my experience, much less fine for low-resource languages where the tokenizer was trained on three CommonCrawl dumps and a hope.

Something to come back to in a fourth reading.