Point of View

GenAI could destroy the value of proprietary data

Home » Research & Insights » GenAI could destroy the value of proprietary data

The emerging capabilities of generative AI (GenAI) threaten to destroy the value of proprietary data. Studies using the open-data-sourced GPT-4 have shown it can outperform Bloomberg’s specialized large language model (LLM) on various financial text analysis tasks. Every enterprise leader must consider what this means for their data strategy.

For decades, firms have worked hard on data capture, ownership, management, storage, security, and privacy. Enterprises organize around data, are paranoid about missing opportunities with it, and live in fear of breaches of it. Owning data continues to be so front-and-center in our thinking that we can forgive enterprise leaders for forgetting what all this data is for. It is, of course, to provide insight to enhance business capabilities.

Without a bank of proprietary data about our customers, it’s hard to think about managing a customer relationship. Similarly, your live and historical data guides your pricing strategies, logistics, and financial planning.

Why should you own data if you can get better performance from open models?

GenAI may force us to rethink our relationship with data. Models using open data source their data from openly available sources, ingesting it from across the publicly available internet. Suppose a more capable LLM using open data can deliver better outcomes than a private LLM using proprietary data. In this case, we must consider why we would persist with private models and all the data ownership hassle and costs.

For example, Bloomberg built a 50-billion-parameter LLM from scratch to enhance its natural language processing (NLP) applications for sentiment analysis, news classification, and answering client questions. It draws on financial language documents collated over forty years, successfully pulling from a 363-billion-token dataset of English-language financial documents. Bloomberg combined this data set with a 345-billion-token public dataset. This is no small language model.

Queen’s University researchers found GPT-4 outperformed Bloomberg GPT on a range of financial tasks

Despite the scale of Bloomberg’s endeavor, one of the first studies to explore advances in generically trained LLMs on financial text analytic tasks, conducted by Queen’s University, Kingston, Canada, found GPT-4 performed better on a range of financial natural language processing (NLP) tasks. The paper is available for download here and details how GPT-4 can outperform a domain-specific pre-trained model (i.e., Bloomberg) on many tasks.

As Ethan Mollick, Associate Professor at the Wharton School, points out, this comparison uses the old, pre-turbo version of GPT-4. It is fair to note that GPT-4 did not outperform BloombergGPT on every comparison, but its superior performance in many tests indicates the rising challenge posed by smart generalist models. The likes of GPT-4, Google’s PaLM 2, and Meta’s Llama 2 are only growing more capable.

As generalist LLM capability improves, the value of proprietary data diminishes

As generalist LLM capability improves, the value of proprietary data diminishes. For example, suppose you want to predict what my next purchase is likely to be. The traditional approach would be to capture as much data as possible, derive insight from it, make a decision based on those insights, and trigger whatever action your particular CRM has identified as the best route to reel me in. For example, with its wealth of data captured over the years, Facebook uses my data to pitch me contextual advertisements its data model creates about me and my habits.

What if you could predict customer purchases with a more capable public-data-fed generalist LLM that needed, for example, just three publicly available data points? If, by virtue of its greater capability, the generalist LLM delivers a more accurate outcome, why do we need our proprietary data? Maybe using the public-data-fed generalist LLM costs a little more—but by the time we strip out our costs for proprietary data capture, storage, management, and other necessities, maybe it doesn’t.

Mollick says the Bloomberg case is not isolated. “It is part of a pattern. The smartest generalist frontier models beat specialized models in specialized topics. Your special proprietary data may be less useful than you think in the world of LLMs.”

GenAI should prompt you to ask difficult questions about your data strategy

If open data LLMs can outperform specialized models on specialized topics, leaders must challenge themselves with some hard questions. Just as our HFS OneOffice™ data cycle expounded three years ago (see Exhibit 1), everything in the enterprise still starts and finishes with data. The question is, does it matter who owns that data if the insights delivered provide the outcomes you desired? And if you no longer need to hold data, what does that mean for your commitments to data acquisition, storage, security, management, and associated resources?

Exhibit 1: Leaders must embed GenAI in their approach to the HFS OneOffice™ data cycle

Source: HFS Research, 2024

Of course, it would be unwise to suddenly start sharing customer data, internal communications, financial information, and other sensitive and confidential information with any third party without the proper controls. For example, leaders such as Microsoft Azure OpenAI offer controls that can address these issues. As the leading LLMs get ever more capable, we must ask if there could be a point when adding your proprietary data no longer positively impacts performance. At what point does the purpose of our data estate change from managing proprietary data to managing openly available data—or a useful combination of both?

Business leaders must be concerned about the risks of handing more of their enterprise’s critical drivers to the likes of Google and Microsoft. But if the business outcomes improve, the benefits may be hard to resist.

The HFS OneOffice™ data cycle is as valid as ever—provided you update your approach by embedding GenAI

Exhibit 1 illustrates how the first step in your data cycle must always be obtaining the data required to win in your market. The only thing GenAI adds is the offer of a new and potentially better-performing source for that data. In step two, you must include the possibility of LLMs when you rethink your processes. In step three, include GenAI in how you design your new operational flows in the cloud before step four, where you automate as many rethought processes as possible. GenAI drives and derives insight in step five, where AI is applied to data flows. The outcomes—within human-set strategic guidelines—feed back into step one to identify how and where to source the data to win in your market, and the cycle repeats.

The Bottom Line: Build GenAI into your data strategy to protect against a future in which the value of your proprietary data may collapse.

The GenAI journey is rapidly accelerating, and new capabilities are always emerging. As the power and capabilities of the leading-edge LLMs increase, there are two essential items on your to-do list:

  1. Revisit your approach to the HFS OneOffice™ Data Cycle in light of the emergence of GenAI. Your strategy is still to refine your use of data and its insights. How does GenAI change your tactics?
  2. Understand the risk to your assumed value of proprietary data. Are you placing a bottom-line value on what you have? Does that remain valid? What will your business look like if that value collapses?

Sign in to view or download this research.

Login

Register

Insight. Inspiration. Impact.

Register now for immediate access of HFS' research, data and forward looking trends.

Get Started

Logo

confirm

Congratulations!

Your account has been created. You can continue exploring free AI insights while you verify your email. Please check your inbox for the verification link to activate full access.

Sign In

Insight. Inspiration. Impact.

Register now for immediate access of HFS' research, data and forward looking trends.

Get Started
ASK
HFS AI