How AI Companies Scrape & Sell Your Data (2026)

Q: Does ChatGPT know personal information about me?

GPT-4 and similar models are trained on public internet data and may have incorporated public content you created. They do not have a searchable record of you specifically, they cannot look up private data about you in real-time unless connected to a data source.

Q: Can I demand that an AI company delete my personal information from its model?

For companies subject to GDPR, you can submit an erasure request. For US-based companies with US-only operations, there is no equivalent legal obligation as of 2026, though some companies accommodate individual erasure requests voluntarily.

AI companies are not just scraping copyrighted content, they are scraping, aggregating, and selling personal identifying information.

The public debate about AI data practices has focused almost entirely on copyrighted content, whether AI companies scraped books, news articles, and code to train their models. This debate, while important, has largely missed a more immediate and personal threat: AI companies are also scraping, aggregating, and in some cases selling personal identifying information about individuals. This guide covers exactly how this happens and what you can do about it.

Key Takeaways

Data broker profiles are licensed to AI companies as training datasets: your WhitePages or Spokeo profile may appear not just on those consumer sites but in AI training data used by thousands of downstream applications
AI-powered identity resolution is the most significant new threat: systems from LexisNexis, TransUnion TLO, and others now probabilistically link previously unconnected records (maiden names, abbreviated addresses, business vs. home addresses) into unified profiles
Training opt-outs are forward-looking only, once a model has been trained on your data, the influence on model weights cannot be surgically removed; opt-outs prevent future training use, not past
Major platforms provide opt-out mechanisms: OpenAI's "Improve the model for everyone" toggle, Google's Gemini Apps Activity control, and Meta's Right to Object form (EU/UK only) all disable future conversation data use for training
Website owners can block AI crawlers via robots.txt directives for GPTBot, Google-Extended, ClaudeBot, and CCBot without affecting standard Google Search indexing
Removing yourself from consumer data broker sites reduces AI exposure at the data-source level, since those profiles feed both consumer people-search features and AI training dataset purchases

The Three Ways AI Companies Handle Personal Data

AI companies interact with personal data in three distinct ways, each with different privacy implications:

1. Training Data Scraping

AI foundation models are trained on large datasets scraped from the public internet. These datasets include public social media posts, forum discussions, news comment sections, blog posts, and public profiles, all of which may contain personal information including names, locations, opinions, and behavioral data.

The privacy implication: Content you posted publicly years ago, a Reddit comment, a public Facebook post, a forum thread, may be incorporated into an AI model's training data and subsequently influence the model's outputs. While your individual post does not give the model a direct "memory" of you, the patterns established by your public data contribute to the model's behavior.

Automate Deletion

Tired of dealing with data exposure?

Your personal data is likely on 545 data brokers. Use OfflistMe to generate pre-filled opt-out emails for all of them in one go.

Start Automated Deletion Free plan is limited to 3 brokers.

What you can do: For most individuals, this training data scraping is not a direct personal privacy threat, it is a diffuse aggregation issue. For people who have had significant public online presence (bloggers, forum moderators, public commenters), the training data issue is more substantive.

2. AI-Powered Data Broker Services

This is the most directly relevant category for individual privacy:

Several companies have built AI-enhanced personal data aggregation services that go beyond traditional data broker capabilities. These services use AI for:

Entity resolution: Using machine learning to link multiple records that refer to the same person, even when names are spelled differently, addresses use abbreviations, or other identifiers vary slightly
Data inference: Inferring attributes not directly available in public records, inferring income from neighborhood property values, inferring health conditions from behavioral patterns, inferring political affiliation from media consumption
Pattern recognition: Identifying behavioral patterns from aggregated data that would not be visible in any individual record

Examples include:

LexisNexis uses AI to resolve identities across billions of records
Acxiom uses machine learning to build predictive consumer profiles
LiveRamp and similar data connection platforms use AI to link anonymized data back to identifiable individuals

These are not consumer-facing people-search sites, they are B2B data infrastructure companies. But their AI-enhanced profiles ultimately flow into the consumer-facing sites that show your home address in Google search results.

3. AI Features on People-Search Sites

Some consumer-facing data broker sites have added AI features:

AI-generated profile summaries: Synthesizing a person's public records into a narrative description
AI chat interfaces: Allowing users to "ask questions" about a person's background
Risk scoring: Using AI to generate "reputation scores" or risk assessments from aggregated data (MyLife's reputation score is an early version of this)

These AI features amplify the privacy risk of existing data broker profiles by making the data more accessible and more damaging to reputation.

The AI Training Data Opt-Out Problem

Since late 2023, major AI companies have provided mechanisms to opt out of having your content used for AI training:

OpenAI: Provides a form for individuals to request that their personal information not be used in ChatGPT or to request deletion of personal information from OpenAI's systems.

Google (Gemini/Vertex AI): Google's privacy controls at myaccount.google.com allow users to manage what data is used for AI training. Google has also added web crawl opt-out mechanisms.

Meta AI: Meta allows users to opt out of their public Facebook and Instagram content being used for AI training in some jurisdictions.

Common Crawl: The Common Crawl dataset (a massive public web crawl widely used for AI training) has an opt-out mechanism at commoncrawl.org for website owners.

These opt-outs are imperfect and often do not apply retroactively to already-processed training data. They primarily affect future training runs.

Data Brokers as AI Training Data Sources

A critical and underreported connection: data broker databases are sold to AI companies as training datasets. The profiles that WhitePages, Spokeo, and BeenVerified compile on you do not just stay on those consumer websites, they are licensed to:

AI companies building foundation models that need "grounding" data about real people
AI companies building search and verification services
B2B data infrastructure companies that sell enriched datasets to other AI applications

This means your data broker profile has a second exposure vector: it may appear in AI training data that propagates your personal information into AI systems used by thousands of companies.

What Removing Yourself from Data Brokers Does for AI Privacy

What it addresses:

Consumer people-search sites that AI systems might query for real-time information about you
Data broker profiles that might be included in future AI training dataset purchases
Profile visibility on AI-powered search features

What it does not address:

AI models already trained on data that included your information
Data broker records that have already been sold to AI companies as training data before your opt-out
Social media content you have already posted publicly

The Emerging Threat: AI-Powered Identity Resolution

The most significant AI-specific privacy development in 2025–2026 is not LLM training data, it is AI-enhanced identity resolution at scale.

Traditional data brokers match records based on exact or near-exact field matches: same name, same zip code, same date of birth. This approach produces false positives (two people named John Smith) and false negatives (records where a name is abbreviated, a maiden name is used, or an address is formatted differently).

AI-powered identity resolution systems solve this problem using probabilistic matching models trained on billions of records. These systems can link:

A voter registration entry using "Jen" with a mortgage record using "Jennifer"
A LinkedIn profile listing "New York, NY" with a county court record from Nassau County
A phone number registered to a business address with the owner's home address two miles away

The result is identity graphs with dramatically fewer gaps than traditional relational database approaches. Companies including LexisNexis Risk Solutions, TransUnion TLO, and Verecor operate AI-enhanced identity resolution at this scale commercially.

Why this matters for individuals: A profile that was previously broken into unlinked fragments, your maiden name record separate from your married name record, your business address separate from your home address, may now be correctly linked and unified by AI resolution systems. The aggregated profile that emerges is more complete and more accurate than what existed before.

What opt-outs accomplish in this environment: Removing your profiles from consumer-facing people-search sites removes the most accessible tier of the identity graph. The underlying B2B identity resolution systems are harder to reach directly. But people-search sites are often the final output layer, removing them reduces downstream accessibility even when the underlying data infrastructure persists.

The "Right to Be Forgotten" in the AI Context

Several European legal challenges have raised the question of whether individuals have a right to have AI systems "forget" their personal information. Under GDPR Article 17, individuals have a right to erasure, but how this applies to AI systems that have already incorporated data into model weights is legally unresolved.

For US residents, there is no equivalent federal right to erasure from AI systems. California's CPPA is actively working on AI-specific regulations, but as of 2026, these are not yet final.

Practical Steps to Reduce Your AI Data Exposure

1. Remove yourself from consumer data broker sites

This is the most impactful action, it addresses the primary pipeline feeding both consumer profiles and AI training datasets. OfflistMe covers 500+ data brokers for $7.00 one-time. Start here.

2. Submit opt-out requests to major AI companies

OpenAI privacy request form at privacy.openai.com
Google AI training opt-out at myaccount.google.com
Meta AI opt-out through your Facebook/Instagram privacy settings

3. Opt out of marketing data sharing

Marketing data, the behavioral profiles from your online activity, is a major input for AI personalization systems. Use your device's privacy settings (iOS Privacy Nutrition Labels, Android Privacy Dashboard) to limit data sharing.

4. Reduce your public records footprint

Minimize how much you add to public records through USPS change-of-address forms, public social media, and other sources that feed data broker and AI training pipelines.

Frequently Asked Questions

Can I demand that an AI company delete my personal information from its model?

For companies subject to GDPR (operating in Europe), yes, you can submit an erasure request and the company must comply to the extent technically feasible. For US-based companies with US-only operations, there is no equivalent legal obligation as of 2026. However, you can submit opt-out requests for future data use, and some companies will voluntarily accommodate individual erasure requests.

Does ChatGPT "know" personal information about me?

GPT-4 and similar large language models are trained on public internet data and may have been trained on public content you created. They do not have a searchable record of you specifically, rather, the training data may have influenced their general knowledge. ChatGPT and similar models cannot look up private data about you in real-time unless specifically connected to a data source.

Are AI-powered people-search features legal?

AI-enhanced people-search services face the same legal framework as traditional data brokers, FCRA restrictions on regulated use cases, CCPA and state privacy law deletion rights, and general tort law. AI enhancement does not change the legal framework, though it may make the harm more severe.

What is the difference between a data broker and an AI company?

Traditional data brokers aggregate and sell personal records databases. AI companies build models and services. The categories overlap significantly: many data broker companies have added AI features, and AI companies purchase data broker datasets for training. The distinction is increasingly blurry.

The FTC's 2025–2026 Actions Against AI Data Practices

The FTC has significantly expanded its scrutiny of AI companies' data collection and use practices in 2025–2026, directly affecting the intersection of AI and personal data:

FTC Report on AI and Data Brokers (2024): The FTC's report on AI found that AI companies frequently source training data from data brokers, creating a pipeline from public records to commercial AI products that lacks consumer disclosure. The report recommended enhanced transparency requirements for AI training data sources.

FTC vs. Avast (2024): The FTC fined antivirus maker Avast $16.5 million for selling consumer browsing data to advertising companies despite claiming to protect users' privacy. The FTC characterized Avast's data-as-a-business-model as an unfair practice.

FTC's AI Guidelines (2025): The FTC issued guidance on AI systems and consumer protection, emphasizing that AI systems trained on personal data remain subject to existing FTC Act protections. Companies cannot use "it was in the training data" as a justification for ignoring consumer deletion requests.

CPPA AI Regulations (2026, in development): The California Privacy Protection Agency has published draft regulations for AI-driven automated decision-making systems, including requirements for consumer disclosure when AI decisions are made using personal data. These regulations, when final, will require companies to disclose when AI models use your data in decisions affecting you.

What this means for consumers:

Opt-out rights against AI training data use are expanding, especially under GDPR and California law
The FTC will increasingly pursue AI companies that acquire personal data through data broker purchases without adequate disclosure
California residents may gain specific AI disclosure and opt-out rights in 2026–2027 once CPPA AI regulations are finalized

Related Guides

Medical & Health Data Brokers Guide: How commercial health data feeds AI medical models.
Opt Out of AI Training Data 2026
What Are Data Brokers?
FTC 2024 Data Broker Report: What It Means for Consumers
Zero-Data Architecture for Privacy
How to Opt Out of Clearview AI
How to Opt Out of PimEyes
Complete Data Broker Opt-Out Guide

Key Takeaways

Data broker profiles are licensed to AI companies as training datasets: your WhitePages or Spokeo profile may appear not just on those consumer sites but in AI training data used by thousands of downstream applications
AI-powered identity resolution is the most significant new threat: systems from LexisNexis, TransUnion TLO, and others now probabilistically link previously unconnected records (maiden names, abbreviated addresses, business vs. home addresses) into unified profiles
Training opt-outs are forward-looking only, once a model has been trained on your data, the influence on model weights cannot be surgically removed; opt-outs prevent future training use, not past
Major platforms provide opt-out mechanisms: OpenAI's "Improve the model for everyone" toggle, Google's Gemini Apps Activity control, and Meta's Right to Object form (EU/UK only) all disable future conversation data use for training
Website owners can block AI crawlers via robots.txt directives for GPTBot, Google-Extended, ClaudeBot, and CCBot without affecting standard Google Search indexing
Removing yourself from consumer data broker sites reduces AI exposure at the data-source level, since those profiles feed both consumer people-search features and AI training dataset purchases

The Three Ways AI Companies Handle Personal Data

AI companies interact with personal data in three distinct ways, each with different privacy implications:

1. Training Data Scraping

Automate Deletion

Tired of dealing with data exposure?

Your personal data is likely on 545 data brokers. Use OfflistMe to generate pre-filled opt-out emails for all of them in one go.

Start Automated Deletion Free plan is limited to 3 brokers.

2. AI-Powered Data Broker Services

This is the most directly relevant category for individual privacy:

Several companies have built AI-enhanced personal data aggregation services that go beyond traditional data broker capabilities. These services use AI for:

Entity resolution: Using machine learning to link multiple records that refer to the same person, even when names are spelled differently, addresses use abbreviations, or other identifiers vary slightly
Data inference: Inferring attributes not directly available in public records, inferring income from neighborhood property values, inferring health conditions from behavioral patterns, inferring political affiliation from media consumption
Pattern recognition: Identifying behavioral patterns from aggregated data that would not be visible in any individual record

Examples include:

LexisNexis uses AI to resolve identities across billions of records
Acxiom uses machine learning to build predictive consumer profiles
LiveRamp and similar data connection platforms use AI to link anonymized data back to identifiable individuals

3. AI Features on People-Search Sites

Some consumer-facing data broker sites have added AI features:

AI-generated profile summaries: Synthesizing a person's public records into a narrative description
AI chat interfaces: Allowing users to "ask questions" about a person's background
Risk scoring: Using AI to generate "reputation scores" or risk assessments from aggregated data (MyLife's reputation score is an early version of this)

These AI features amplify the privacy risk of existing data broker profiles by making the data more accessible and more damaging to reputation.

The AI Training Data Opt-Out Problem

Since late 2023, major AI companies have provided mechanisms to opt out of having your content used for AI training:

OpenAI: Provides a form for individuals to request that their personal information not be used in ChatGPT or to request deletion of personal information from OpenAI's systems.

Google (Gemini/Vertex AI): Google's privacy controls at myaccount.google.com allow users to manage what data is used for AI training. Google has also added web crawl opt-out mechanisms.

Meta AI: Meta allows users to opt out of their public Facebook and Instagram content being used for AI training in some jurisdictions.

Common Crawl: The Common Crawl dataset (a massive public web crawl widely used for AI training) has an opt-out mechanism at commoncrawl.org for website owners.

These opt-outs are imperfect and often do not apply retroactively to already-processed training data. They primarily affect future training runs.

Data Brokers as AI Training Data Sources

AI companies building foundation models that need "grounding" data about real people
AI companies building search and verification services
B2B data infrastructure companies that sell enriched datasets to other AI applications

This means your data broker profile has a second exposure vector: it may appear in AI training data that propagates your personal information into AI systems used by thousands of companies.

What Removing Yourself from Data Brokers Does for AI Privacy

What it addresses:

Consumer people-search sites that AI systems might query for real-time information about you
Data broker profiles that might be included in future AI training dataset purchases
Profile visibility on AI-powered search features

What it does not address:

AI models already trained on data that included your information
Data broker records that have already been sold to AI companies as training data before your opt-out
Social media content you have already posted publicly

The Emerging Threat: AI-Powered Identity Resolution

The most significant AI-specific privacy development in 2025–2026 is not LLM training data, it is AI-enhanced identity resolution at scale.

AI-powered identity resolution systems solve this problem using probabilistic matching models trained on billions of records. These systems can link:

A voter registration entry using "Jen" with a mortgage record using "Jennifer"
A LinkedIn profile listing "New York, NY" with a county court record from Nassau County
A phone number registered to a business address with the owner's home address two miles away

The "Right to Be Forgotten" in the AI Context

For US residents, there is no equivalent federal right to erasure from AI systems. California's CPPA is actively working on AI-specific regulations, but as of 2026, these are not yet final.

Practical Steps to Reduce Your AI Data Exposure

1. Remove yourself from consumer data broker sites

This is the most impactful action, it addresses the primary pipeline feeding both consumer profiles and AI training datasets. OfflistMe covers 500+ data brokers for $7.00 one-time. Start here.

2. Submit opt-out requests to major AI companies

OpenAI privacy request form at privacy.openai.com
Google AI training opt-out at myaccount.google.com
Meta AI opt-out through your Facebook/Instagram privacy settings

3. Opt out of marketing data sharing

4. Reduce your public records footprint

Minimize how much you add to public records through USPS change-of-address forms, public social media, and other sources that feed data broker and AI training pipelines.

Frequently Asked Questions

Can I demand that an AI company delete my personal information from its model?

Does ChatGPT "know" personal information about me?

Are AI-powered people-search features legal?

What is the difference between a data broker and an AI company?

The FTC's 2025–2026 Actions Against AI Data Practices

The FTC has significantly expanded its scrutiny of AI companies' data collection and use practices in 2025–2026, directly affecting the intersection of AI and personal data:

What this means for consumers:

Opt-out rights against AI training data use are expanding, especially under GDPR and California law
The FTC will increasingly pursue AI companies that acquire personal data through data broker purchases without adequate disclosure
California residents may gain specific AI disclosure and opt-out rights in 2026–2027 once CPPA AI regulations are finalized

Related Guides

Medical & Health Data Brokers Guide: How commercial health data feeds AI medical models.
Opt Out of AI Training Data 2026
What Are Data Brokers?
FTC 2024 Data Broker Report: What It Means for Consumers
Zero-Data Architecture for Privacy
How to Opt Out of Clearview AI
How to Opt Out of PimEyes
Complete Data Broker Opt-Out Guide

How AI Companies Scrape and Sell Your Personal Data in 2026

Key Takeaways

The Three Ways AI Companies Handle Personal Data

1. Training Data Scraping

Tired of dealing with data exposure?

2. AI-Powered Data Broker Services

3. AI Features on People-Search Sites

The AI Training Data Opt-Out Problem

Data Brokers as AI Training Data Sources

What Removing Yourself from Data Brokers Does for AI Privacy

The Emerging Threat: AI-Powered Identity Resolution

The "Right to Be Forgotten" in the AI Context

Practical Steps to Reduce Your AI Data Exposure

Frequently Asked Questions

The FTC's 2025–2026 Actions Against AI Data Practices

Related Guides

Understand your privacy rights

Related Data Broker Removal Guides

Take back your privacy today

How AI Companies Scrape and Sell Your Personal Data in 2026

Key Takeaways

The Three Ways AI Companies Handle Personal Data

1. Training Data Scraping

Tired of dealing with data exposure?

2. AI-Powered Data Broker Services

3. AI Features on People-Search Sites

The AI Training Data Opt-Out Problem

Data Brokers as AI Training Data Sources

What Removing Yourself from Data Brokers Does for AI Privacy

The Emerging Threat: AI-Powered Identity Resolution

The "Right to Be Forgotten" in the AI Context

Practical Steps to Reduce Your AI Data Exposure

Frequently Asked Questions

The FTC's 2025–2026 Actions Against AI Data Practices

Related Guides

Understand your privacy rights

Related Data Broker Removal Guides

Take back your privacy today

How AI Companies Scrape and Sell Your Personal Data in 2026

Key Takeaways

The Three Ways AI Companies Handle Personal Data

1. Training Data Scraping

Tired of dealing with data exposure?

2. AI-Powered Data Broker Services

3. AI Features on People-Search Sites

The AI Training Data Opt-Out Problem

Data Brokers as AI Training Data Sources

What Removing Yourself from Data Brokers Does for AI Privacy

The Emerging Threat: AI-Powered Identity Resolution

The "Right to Be Forgotten" in the AI Context

Practical Steps to Reduce Your AI Data Exposure

Frequently Asked Questions

The FTC's 2025–2026 Actions Against AI Data Practices

Related Guides

Understand your privacy rights

Related Data Broker Removal Guides

Take back your privacy today

Related Articles

How AI Companies Scrape and Sell Your Personal Data in 2026

Key Takeaways

The Three Ways AI Companies Handle Personal Data

1. Training Data Scraping

Tired of dealing with data exposure?

2. AI-Powered Data Broker Services

3. AI Features on People-Search Sites

The AI Training Data Opt-Out Problem

Data Brokers as AI Training Data Sources

What Removing Yourself from Data Brokers Does for AI Privacy

The Emerging Threat: AI-Powered Identity Resolution

The "Right to Be Forgotten" in the AI Context

Practical Steps to Reduce Your AI Data Exposure

Frequently Asked Questions

The FTC's 2025–2026 Actions Against AI Data Practices

Related Guides

Understand your privacy rights

Related Data Broker Removal Guides

Take back your privacy today

Related Articles