The Margin Notes Were the Real Dataset All Along

The Margin Notes Were the Real Dataset All Along

I was reading a piece from someone who managed to test what looks like an early release version of Gemini 3, and it was super interesting. The model handled two problems that have held back digital work in law for years. First, it could read messy handwriting with surprising accuracy. Second, it seemed able to understand what was written even when the source was old, inconsistent or only half legible.

Put those two together and the implications for the legal sector become quite amazing.

Firms have always focused on their future. New tools, new workflows, new billable models. Yet the real treasure is the paper archive that has been building for decades. Marginalia (I discovered this wonderful word while reading the article) from partners who are long retired. Post it notes that shaped negotiations. Scanned copies of signed contracts covered in handwritten carve outs. PDF bundles full of scribbles that nobody has ever manually transcribed. All of that has been locked away because no one had the time or the budget to turn it into something searchable.

That problem may not be a problem for much longer.


Every firm holds a shadow dataset they can barely use. A pile of scanned documents with handwritten commentary, now some of it is trivial, but some of it reflects judgment calls. A fair amount of it is the only surviving record of why a deal took a particular shape. You cannot train modern legal AI on this material because you cannot extract it cleanly. Also you're unable to mine it for patterns because you cannot search inside it and definitely cannot trust it because the handwriting is often illegible even to the human eye.

If these new models do what early examples suggest, that entire corpus becomes available. Not as fuzzy OCR output, but as clean, structured and reasoned data. The machine can read a lawyer’s scribble, link it to a clause, interpret a correction, understand a reference, and express it in text that fits into contract analysis tooling.

For the first time you get access to the thinking that happened between versions, not just the version that ended up filed.


Risk teams finally get the visibility they have always wanted

A surprising amount of contract risk hides in handwritten annotations. A partner might agree something with a client in the margin. A number may have been changed during negotiations but never passed through to a final typed version.

Most of that has been invisible. You can only discover it during a dispute when someone re opens the original files and finds a marginal note that changes the interpretation of an obligation.

Automated handwriting recognition with reasoning changes the risk equation. It means firms can scan entire paper archives, surface inconsistencies, attach margin notes to the correct provisions and flag places where handwriting contradicts typed text. You move from reactive discovery to proactive risk management.

For large matters or repeat client relationships this is a biggie. You do not just know what the contract said, you can see how it evolved.


Law firms talk a lot about training their own models, but most models are trained on final documents and drafted language. Very little training data captures how lawyers think, in how they correct clauses or the approach they take to evaluate a contract in motion, not at the end of the process.

Handwritten markups are one of the best signals of real judgement. They show what lawyers notice first, or what they deem immaterial. They reveal patterns far more useful for legal AI than finalised boilerplate.

With reliable extraction you suddenly get tens of thousands of examples of real decision making. Not simulated tasks or post hoc explanations, but raw reasoning from human lawyers as they work. That sort of dataset has never been accessible at scale.

It will not replace human judgement, but it helps firms build more aligned and grounded AI systems. Models that understand how a lawyer actually behaves when they fix a clause, not how a textbook describes the problem.


New opportunities for knowledge and precedent teams

Knowledge teams spend an extraordinary amount of time trying to reconstruct why a clause changed over time. They usually rely on version comparisons, conversations with partners and educated guesses. Margin notes would solve most of those puzzles if they were searchable.

Now we're in a position of running a query across the entire archive asking for every handwritten carve out that limited liability in tech services agreements.

  • every margin note where a partner flagged a confidentiality provision as non standard.
  • every manual correction that moved a date or threshold.
  • maybe the occasional bit of profanity...

You do not just enrich precedents, you map the lived history of practice and turn qualitative experience into structured knowledge.

This is the kind of insight firms have always claimed they can extract but have never truly been able to deliver.


Due diligence becomes much more defensible

A lot of the time spent in due diligence is basic manual work. Reading scan after scan of old agreements, checking whether handwritten amendments were ever incorporated, confirming that small edits were not missed across versions.

If these models work as shown, the first pass of due diligence shifts from human eyes to machine extraction. You still need human judgement, but the grind shrinks. You get clean lists of margin notes, corrections, clause references and potential mismatches. Reviewers focus on evaluating risk rather than finding the risk in the first place.

This also matters for defensibility. You can show how you reached your conclusions. You can point to every extracted annotation. You create a consistent record of what was processed and why.


Legal AI has been constrained by the quality of the data it can reach. Firms have always had more knowledge in their archives than in their systems, but they have never been able to unlock it.

Most law firm innovation has focused on what lawyers will do tomorrow. The bigger opportunity lies in what lawyers already did, documented, corrected and scribbled over the last forty years.

If this technology matures, the firms that act early will gain an advantage that is very difficult to replicate. Everyone can buy the same models. No one can buy your historical margin notes.

Your archive becomes your differentiator.