Skip to content

Protecting Proprietary Data Rights in the AI Era

Kathryn Shih

April 2, 2024

  • Blog Post

New risks to proprietary company data are emerging as generative AI usage becomes a common practice for startups, enterprises, and consumers alike. Today, third-party partners and customers may be able to train AI models using proprietary but user-accessible data. 

As Forgepoint Capital’s Entrepreneur in Residence and having worked extensively at the intersection of cybersecurity, AI and generative AI, I know firsthand how much is at stake for innovators whose unique datasets power their products and services. As legal frameworks around acceptable data use in AI models continue to evolve, companies must stay ahead of the curve to protect their data.  

To meet the challenge head-on, I worked with Mercy Caprara, Head of Portfolio Operations at Forgepoint, to develop a set of recommendations to help our portfolio companies protect their data from unauthorized use in AI systems. Today, I’m excited to share our insights with the broader Forgepoint community.  

What makes data valuable for AI model training?   

High quality data is one of the keys to AI model quality and performance. Companies own valuable data that external users can leverage to train independent AI models and replicate proprietary capabilities.  

To understand the risks at play, it’s important to consider what makes a dataset valuable for AI training: 

  • The dataset contains proprietary examples of successfully completed, hard-to-perform tasks: For example, security datasets that contain examples of detecting malicious behavior in networks, code, or other contexts. 
  • The dataset contains valuable, non-public trends: This is typical of data generated in a domain with a complex, high-value prediction task. For example, datasets generated while managing fraud or risk indicators that can be used to better predict outcomes.  
  • The dataset represents an operational process with superior capabilities or efficiencies that is not public knowledge: Users can infer a process when datasets either directly document the process or describe a substantial number of individual process executions. This is primarily a risk for service businesses that have achieved best-in-class scaling, as their datasets convey a competitive benefit. 

Data harvesting 101 

While most datasets are not typically made fully public, users can often access small amounts of valuable data during an interaction. For example, a user may have access to a tool which allows them to see a single risk indicator. If they are able to repeatedly use the tool, they can access many different indicators and slowly reveal a meaningful portion of the dataset, extracting and storing data to use for AI model training. This is called “harvesting” and is typically performed by repeatedly querying APIs or product features that are backed by proprietary data. 

The legal status of data harvesting for model improvement remains unresolved despite the high value of data in AI service development.  By default, customers and partners may be able to legally harvest company data to develop or train their own AI models.   

Taking action to protect proprietary data 

Fortunately, companies have a few options to promote permissible behaviors and protect their data.

1. Update the language in ToS (Terms of Service) agreements and MSA (Master Service Agreement) contracts 

Companies can limit proprietary data usage through policies in ToS agreements on their websites and in the MSA contracts they use to sell services. By updating these documents to explicitly reserve all AI data training rights, companies can maintain a competitive advantage in their own AI models while also monetizing data for use in non-competitive partner offerings.  

Though existing MSA and ToS language may already protect against the use of services to develop competitive or substitute products or services, this is often not enough protection in cases of AI model training. Company data may have value in AI use cases which do not directly compete with or substitute for existing product lines. As such, it’s critical to update MSA and ToS language with new policies focused on AI use cases.  

2. Log repeated access to proprietary data 

Since harvesting involves querying to gather data, companies should log cases of repeated access to high-value data in long-term storage. This can provide evidence of suspicious user activity, a ToS violation, or a contract breach.

Sample ToS and MSA language  

To guide your proprietary data protection efforts, we have developed sample language for a policy change which could be added to an existing ToS agreement or adapted for MSA  documents.                                   

Note: when implementing policy changes, use custom language that reflects the specific risks and needs in your business.

By accessing and using this website and/or our services, you agree not to use, copy, or extract any part of the content (“Content”), including but not limited to text, images, data, and code, for the purpose of training artificial intelligence systems, machine learning models, or any other form of data analysis software, without our express written permission, including, but not limited to, scraping, data mining, and the use of any automated or manual process to capture or compile content for the purposes mentioned above. We reserve the right, and you have no right to, reproduce and/or otherwise use Content in any manner for purposes of training artificial intelligence technologies to generate text or other output, including without limitation, technologies that are capable of generating works in the same style or genre as the Content, unless you obtains our specific and express permission to do so. Nor do you have the right to sublicense others to reproduce and/or otherwise use Content in any manner for purposes of training artificial intelligence technologies to generate text or other output without our specific and express permission. Unauthorized use of Content for artificial intelligence training or related purposes is strictly prohibited and will be considered a breach of these Terms of Service, which may result in immediate termination of your access to the website and may lead to legal action for copyright infringement and other remedies as permitted by law. 

*Disclaimer: The recommendations and sample language in this blog do not constitute legal advice and are general information meant for educational purposes. Consult a legal professional before updating or modifying your company’s MSA or ToS.*