Using LLMs to Unify and Enrich Data Across Multiple Sources
LLMs can unify and standardize threat data across disparate sources without complex integration projects. Learn practical approaches for seamlessly combining threat feeds, logs, and incident data into enriched intelligence.
Organizations today face significant challenges with data fragmentation. Critical information is distributed across multiple databases, APIs, and third-party sources, making comprehensive analysis difficult. Large Language Models (LLMs) offer practical solutions for unifying disparate data sources and enriching them with valuable metadata.
The Data Integration Challenge
Most enterprises maintain data across numerous systems:
- Internal databases with proprietary information
- Third-party threat intelligence feeds
- SIEM and log management platforms
- Vulnerability scanning results
- Incident management systems
- External APIs providing contextual data
Each system uses different schemas, update frequencies, and data formats. Traditional integration approaches require complex ETL processes, careful schema design, and continuous maintenance as sources change.
How LLMs Transform Multi-Source Data
LLMs bring several technical advantages to data integration problems:
Schema Harmonization
LLMs can understand semantic relationships between differently structured data. They recognize that "threat_actor" in one system might correspond to "attacker_id" in another without requiring explicit mapping rules.
Contextual Enrichment
Beyond basic field matching, LLMs can generate metadata that spans sources:
- Extracting entities from unstructured text and linking them to structured records
- Identifying relationships between events across different systems
- Generating risk assessments based on patterns across multiple data points
Practical Example: Unified Threat Intelligence
Security teams typically consume threat intelligence from multiple sources with minimal integration between them. Here's how an LLM-based approach works in practice:
Input Sources:
- Internal security logs (IP addresses, timestamps, action types)
- Commercial threat feed (known malicious indicators, actor attribution)
- OSINT feeds (CVE information, exploit availability)
- Historical incident records (past attack patterns, response documentation)
Example Prompt:
Process the following security data sources and create unified threat intelligence records:
1. Internal logs showing connection attempts from IP 198.51.100.12 to internal servers
2. AlienVault OTX data on the same IP address
3. VirusTotal reports for related file hashes
4. Recent CVE records for the affected systems
Identify entities across all sources, establish relationships between them, and provide a consolidated threat assessment with confidence levels.
LLM Output:
{
"unified_threat_record": {
"primary_indicators": {
"ip_address": "198.51.100.12",
"associated_domains": ["malicious-payload.example.com", "c2server.example.net"],
"file_hashes": ["8a91681d38a069e28619034aa1625af19884d5c5", "d41d8cd98f00b204e9800998ecf8427e"]
},
"activity_summary": {
"first_observed": "2023-11-15T08:42:13Z",
"last_observed": "2023-11-17T14:22:01Z",
"internal_targets": ["web-srv-04", "db-prod-11"],
"techniques": ["Initial Access:Phishing", "Credential Access:Brute Force"],
"connection_patterns": "Irregular access attempts following predictable timing intervals of 37-42 minutes"
},
"cross_source_context": {
"actor_attribution": {
"name": "APT-41",
"confidence": "medium",
"evidence": "Tool patterns and infrastructure overlap with previously observed campaigns",
"contradicting_evidence": "Timing patterns differ from typical APT-41 operations"
},
"associated_campaign": {
"name": "COPPER TYPHOON",
"target_industries": ["Financial Services", "Healthcare"],
"estimated_timeframe": "October 2023 - Present"
},
"vulnerability_exploitation": {
"cves": ["CVE-2023-32342", "CVE-2023-29991"],
"patch_status": "Patched on 7/19 internal systems, 4 systems pending updates"
}
},
"derived_insights": {
"risk_assessment": "High - Active exploitation of systems with partial patch deployment",
"similar_historical_incidents": ["INC-2022-0113", "INC-2023-0027"],
"estimated_attack_stage": "Lateral Movement - Post initial compromise",
"recommended_mitigations": [
"Isolate unpatched systems until updates can be applied",
"Implement additional authentication monitoring for affected systems",
"Search for indicators of persistence mechanisms matching COPPER TYPHOON tactics"
]
}
}
}
Example: Merging Structured and Unstructured Intelligence
Security analysts often need to connect formal threat reports with unstructured information from news articles, blog posts, and security research.
Prompt:
Given the following data sources, identify key information about the recently discovered "BlackMamba" ransomware operation:
1. Structured CTI feed data on IOCs and TTPs (STIX format)
2. Security blog posts discussing technical details of the malware
3. Internal incident reports mentioning potential related activity
4. News articles about recent victims in the manufacturing sector
Extract and unify information across these sources to create a comprehensive profile of this threat.
Output:
BLACKMAMBA RANSOMWARE ANALYSIS
OVERVIEW:
BlackMamba is a ransomware operation first observed in October 2023. Unlike
typical ransomware, it uses a two-stage encryption process and targets backup
systems specifically. Evidence from multiple sources indicates the operation
is linked to former members of the disbanded REvil group.
TECHNICAL INDICATORS:
- C2 Infrastructure:
* Primary domains: mamba-control[.]net, backup-verify[.]com, secure-checker[.]org
* IP ranges: 194.26.x.x, 87.231.x.x
* Uses Cobalt Strike BEACON (Version 4.7) for initial access
- Malware Characteristics:
* SHA256: e9d53d9f81e364a76183b764662ca2aee1f38929834d9fc48e3aeeb63cbe7b96
* Initial infection via malicious Excel documents with hidden macro functions
* Uses modified EternalBlue exploit for lateral movement
* Encryption targets database files first, then system files
* Leaves ransom note as "recover_instructions_[random].html"
CAMPAIGN TIMELINE:
- 2023-10-12: First samples identified in VirusTotal
- 2023-10-15: First public victim (German manufacturing firm)
- 2023-11-03: Updated variant detected with enhanced anti-VM capabilities
- 2023-11-14: First observed in healthcare sector
- 2023-11-18: Internal log evidence of scanning activity from associated IPs
VICTIM TARGETING:
- Primary sectors: Manufacturing (68%), Healthcare (17%), Financial (8%)
- Geographic focus: Western Europe (52%), North America (31%)
- Target selection appears to prioritize organizations with:
* Annual revenue >$100M
* Windows Server 2016 infrastructure
* Networked industrial control systems
RELATION TO INTERNAL EVENTS:
- Match with scanning activity detected on Nov 18-21 (incident #IR-2023-42)
- Partial overlap with TTPs observed in January breach attempt
- No confirmed infections in our environment
ASSESSMENT:
- Attribution: High confidence of connection to former REvil operators
- Sophistication: Medium-high (innovative techniques, professional operation)
- Threat level: High for manufacturing sector entities with matching profile
Implementation Approach
A practical implementation strategy for LLM-based data integration includes:
- Data access layer - Create lightweight connectors to each source system
- Contextual prompting - Develop specialized prompts for different integration scenarios
- Enrichment pipeline - Process source data through LLMs to generate unified records
- Verification mechanisms - Implement confidence scoring and human review workflows
- Storage solution - Create appropriate data structures for the enriched, unified data
- Query interfaces - Build specialized prompts for different analytical questions
Technical Considerations
Several technical factors should be addressed when implementing this approach:
- Processing efficiency - Batching related records to minimize API calls
- Identity resolution - Establishing confidence thresholds for entity matching
- Temporal alignment - Handling different update frequencies across sources
- Schema evolution - Adapting to changes in source data structures
- Training augmentation - Providing domain-specific context for optimal LLM performance
Practical Applications
This approach has demonstrated value in several areas:
Threat Intelligence Fusion
Security teams use LLMs to create unified threat intelligence by combining data from commercial feeds, open-source intelligence, internal security logs, and historical incident records. The enriched output provides context that no single source contains.
Supply Chain Risk Management
Organizations monitor suppliers by unifying data from financial systems, news sources, compliance databases, and operational metrics. LLMs identify risk patterns that span these sources, such as connecting production delays with regulatory issues and financial indicators.
Competitive Intelligence
Market analysis teams use LLMs to blend structured data on competitor products with unstructured information from earnings calls, social media, patent filings, and customer feedback. The unified view reveals strategy shifts and market positioning that would be missed when analyzing each source independently.
Conclusion
LLMs represent a practical approach to data integration challenges that have traditionally required complex engineering solutions. By leveraging their semantic understanding capabilities, organizations can quickly unify disparate data sources and generate valuable cross-source insights. While not replacing enterprise data integration for core systems, this approach offers a flexible complement that can address targeted use cases with significantly less implementation overhead.