Harnessing Vision Models to Unlock the Power of Visual Imagery in Legal Tech (and Wine Reviews)

At the weekend I was visiting my parents, to watch the FA Cup Final with my dad, before the match he asking about what I've been doing at work, I said it's been "AI, AI and AI recently" and decided to show him the new GPT-4 Omni model, I grabbed a near by newspaper, took a picture of some wine featured in a review section

Now as a human we can quickly decipher that the name of the Wine along with origin is in bolded red, followed by bolded black for the alcohol percentage, italicised is the retailer, followed by the price and then a description. However pass this to a classic OCR tool and you'd get "Asconi Moscato Rosé, 2023 Specially Selected, 2023 Agapi by Kintonis, 2023 Abacus Cotes de" which is useless.

So I sent the image to GPT-4 Omni and asked "Turn this into JSON" and I got back:

A perfectly OCR'd response, appropriately structured - so a vision model can describe the contents of an image in detail, converting visual data into text that can then be read aloud by screen readers in a more human way - this opens up new possibilities for inclusivity and accessibility, ensuring that everyone can interact with and benefit from visual information.

That's great "but how does this impact legal tech" I hear you say, well let's go into another example, better understanding of documents.

Let's take this lease agreement,

Asking Omni to convert this into JSON, noting any handwritten text gives me:

We now have a structured version of this front page, along with insights into the content and potential flags that a human can check such as the handwritten date - we could go deeper into who wrote what by maintained a history of the handwriting types, colour of pen - across the document to get even more detail out of a document.

Another example? Okay. Annual reports - often full of people high fiving each other, looking over fields and visualisations that don't translate well through classic document review tools. Let's take this example from a Volvo annual report.

Using a standard OCR tool I get:

Clear opportunity for growth
• Capturing industry growth • Untapped service potential • Electric vehicles and machines — higher sales value — increased market shares — increased service contract penetration and duration • Autonomous solutions • Energy services • Digital services
Sales CAGR %
pre-covid

Yesterday Today
Ambition for tomorrow

10% Operating margin.

Erm, sort of useful - but where's the weird graph content - where's the context?

If I ask Omni "What is this trying to convey?" and I get:

The image is a business presentation slide titled "Clear opportunity for growth." It aims to convey potential areas for future growth and the ambition for increasing sales and operating margins. Here are the key points and elements it highlights:

Growth Opportunities:

Capturing industry growth: Emphasising the potential to leverage overall industry growth.

Untapped service potential: Highlighting unexplored areas in service offerings.

Electric vehicles and machines: Stressing the importance of:

... cut for brevity

The graph compares sales (CAGR %) and operating margins from "Yesterday" (historical data) to "Today" (current status) and shows an upward trend aiming for "Ambition for tomorrow."

Yesterday: Pre-COVID, sales CAGR was approximately 3%.

Today: Indicates current growth status.

Ambition for tomorrow: The goal is to achieve sales growth of over 10% in the future.

The overall message of the slide is that there are significant growth opportunities in various areas such as industry growth, service potential, electric vehicles, autonomous solutions, energy, and digital services. By tapping into these opportunities, the company aims to achieve higher sales growth and improve operating margins in the future.

We get a much richer set of information from the same data, it's interpreting the intent of the graph so if we wanted to find references to the sales growth intention we could, compared to the classic OCR where we'd never find that intent.

The integration of vision models in to our legal tech workflows would be a significant advancement in how visual data is processed and utilised. These models not only make visual information accessible to a broader audience but also enhance the searchability and analysis of complex visual content - like annual reports, leases or reviews of some rosé wine.