Using LLMs to Unify and Enrich Data Across Multiple Sources

Organizations today face significant challenges with data fragmentation. Critical information is distributed across multiple databases, APIs, and third-party sources, making comprehensive analysis difficult. Large Language Models (LLMs) offer practical solutions for unifying disparate data sources and enriching them with valuable metadata.

The Data Integration Challenge

Most enterprises maintain data across numerous systems:

Internal databases with proprietary information
Third-party threat intelligence feeds
SIEM and log management platforms
Vulnerability scanning results
Incident management systems
External APIs providing contextual data

Each system uses different schemas, update frequencies, and data formats. Traditional integration approaches require complex ETL processes, careful schema design, and continuous maintenance as sources change.

How LLMs Transform Multi-Source Data

LLMs bring several technical advantages to data integration problems:

Schema Harmonization

LLMs can understand semantic relationships between differently structured data. They recognize that "threat_actor" in one system might correspond to "attacker_id" in another without requiring explicit mapping rules.

Contextual Enrichment

Beyond basic field matching, LLMs can generate metadata that spans sources:

Extracting entities from unstructured text and linking them to structured records
Identifying relationships between events across different systems
Generating risk assessments based on patterns across multiple data points

Practical Example: Unified Threat Intelligence

Security teams typically consume threat intelligence from multiple sources with minimal integration between them. Here's how an LLM-based approach works in practice:

Input Sources:

Internal security logs (IP addresses, timestamps, action types)
Commercial threat feed (known malicious indicators, actor attribution)
OSINT feeds (CVE information, exploit availability)
Historical incident records (past attack patterns, response documentation)

Example Prompt:

Process the following security data sources and create unified threat intelligence records:

1. Internal logs showing connection attempts from IP 198.51.100.12 to internal servers
2. AlienVault OTX data on the same IP address 
3. VirusTotal reports for related file hashes
4. Recent CVE records for the affected systems

Identify entities across all sources, establish relationships between them, and provide a consolidated threat assessment with confidence levels.

LLM Output:

{
  "unified_threat_record": {
    "primary_indicators": {
      "ip_address": "198.51.100.12",
      "associated_domains": ["malicious-payload.example.com", "c2server.example.net"],
      "file_hashes": ["8a91681d38a069e28619034aa1625af19884d5c5", "d41d8cd98f00b204e9800998ecf8427e"]
    },
    "activity_summary": {
      "first_observed": "2023-11-15T08:42:13Z",
      "last_observed": "2023-11-17T14:22:01Z",
      "internal_targets": ["web-srv-04", "db-prod-11"],
      "techniques": ["Initial Access:Phishing", "Credential Access:Brute Force"],
      "connection_patterns": "Irregular access attempts following predictable timing intervals of 37-42 minutes"
    },
    "cross_source_context": {
      "actor_attribution": {
        "name": "APT-41",
        "confidence": "medium",
        "evidence": "Tool patterns and infrastructure overlap with previously observed campaigns",
        "contradicting_evidence": "Timing patterns differ from typical APT-41 operations"
      },
      "associated_campaign": {
        "name": "COPPER TYPHOON",
        "target_industries": ["Financial Services", "Healthcare"],
        "estimated_timeframe": "October 2023 - Present"
      },
      "vulnerability_exploitation": {
        "cves": ["CVE-2023-32342", "CVE-2023-29991"],
        "patch_status": "Patched on 7/19 internal systems, 4 systems pending updates"
      }
    },
    "derived_insights": {
      "risk_assessment": "High - Active exploitation of systems with partial patch deployment",
      "similar_historical_incidents": ["INC-2022-0113", "INC-2023-0027"],
      "estimated_attack_stage": "Lateral Movement - Post initial compromise",
      "recommended_mitigations": [
        "Isolate unpatched systems until updates can be applied",
        "Implement additional authentication monitoring for affected systems",
        "Search for indicators of persistence mechanisms matching COPPER TYPHOON tactics"
      ]
    }
  }
}

Example: Merging Structured and Unstructured Intelligence

Security analysts often need to connect formal threat reports with unstructured information from news articles, blog posts, and security research.

Prompt:

Given the following data sources, identify key information about the recently discovered "BlackMamba" ransomware operation:

1. Structured CTI feed data on IOCs and TTPs (STIX format)
2. Security blog posts discussing technical details of the malware
3. Internal incident reports mentioning potential related activity
4. News articles about recent victims in the manufacturing sector

Extract and unify information across these sources to create a comprehensive profile of this threat.

Output:

BLACKMAMBA RANSOMWARE ANALYSIS

OVERVIEW:
BlackMamba is a ransomware operation first observed in October 2023. Unlike 
typical ransomware, it uses a two-stage encryption process and targets backup 
systems specifically. Evidence from multiple sources indicates the operation 
is linked to former members of the disbanded REvil group.

TECHNICAL INDICATORS:
- C2 Infrastructure:
  * Primary domains: mamba-control[.]net, backup-verify[.]com, secure-checker[.]org
  * IP ranges: 194.26.x.x, 87.231.x.x
  * Uses Cobalt Strike BEACON (Version 4.7) for initial access
  
- Malware Characteristics:
  * SHA256: e9d53d9f81e364a76183b764662ca2aee1f38929834d9fc48e3aeeb63cbe7b96
  * Initial infection via malicious Excel documents with hidden macro functions
  * Uses modified EternalBlue exploit for lateral movement
  * Encryption targets database files first, then system files
  * Leaves ransom note as "recover_instructions_[random].html"

CAMPAIGN TIMELINE:
- 2023-10-12: First samples identified in VirusTotal
- 2023-10-15: First public victim (German manufacturing firm)
- 2023-11-03: Updated variant detected with enhanced anti-VM capabilities
- 2023-11-14: First observed in healthcare sector
- 2023-11-18: Internal log evidence of scanning activity from associated IPs

VICTIM TARGETING:
- Primary sectors: Manufacturing (68%), Healthcare (17%), Financial (8%)
- Geographic focus: Western Europe (52%), North America (31%)
- Target selection appears to prioritize organizations with:
  * Annual revenue >$100M
  * Windows Server 2016 infrastructure
  * Networked industrial control systems

RELATION TO INTERNAL EVENTS:
- Match with scanning activity detected on Nov 18-21 (incident #IR-2023-42)
- Partial overlap with TTPs observed in January breach attempt
- No confirmed infections in our environment

ASSESSMENT:
- Attribution: High confidence of connection to former REvil operators
- Sophistication: Medium-high (innovative techniques, professional operation)
- Threat level: High for manufacturing sector entities with matching profile

Implementation Approach

A practical implementation strategy for LLM-based data integration includes:

Data access layer - Create lightweight connectors to each source system
Contextual prompting - Develop specialized prompts for different integration scenarios
Enrichment pipeline - Process source data through LLMs to generate unified records
Verification mechanisms - Implement confidence scoring and human review workflows
Storage solution - Create appropriate data structures for the enriched, unified data
Query interfaces - Build specialized prompts for different analytical questions

Technical Considerations

Several technical factors should be addressed when implementing this approach:

Processing efficiency - Batching related records to minimize API calls
Identity resolution - Establishing confidence thresholds for entity matching
Temporal alignment - Handling different update frequencies across sources
Schema evolution - Adapting to changes in source data structures
Training augmentation - Providing domain-specific context for optimal LLM performance

Practical Applications

This approach has demonstrated value in several areas:

Threat Intelligence Fusion

Security teams use LLMs to create unified threat intelligence by combining data from commercial feeds, open-source intelligence, internal security logs, and historical incident records. The enriched output provides context that no single source contains.

Supply Chain Risk Management

Organizations monitor suppliers by unifying data from financial systems, news sources, compliance databases, and operational metrics. LLMs identify risk patterns that span these sources, such as connecting production delays with regulatory issues and financial indicators.

Competitive Intelligence

Market analysis teams use LLMs to blend structured data on competitor products with unstructured information from earnings calls, social media, patent filings, and customer feedback. The unified view reveals strategy shifts and market positioning that would be missed when analyzing each source independently.

Conclusion

LLMs represent a practical approach to data integration challenges that have traditionally required complex engineering solutions. By leveraging their semantic understanding capabilities, organizations can quickly unify disparate data sources and generate valuable cross-source insights. While not replacing enterprise data integration for core systems, this approach offers a flexible complement that can address targeted use cases with significantly less implementation overhead.