This document provides a comprehensive overview of the BlackWeb project, including its purpose, architecture, components, and deployment model. BlackWeb is a domain blocklist aggregation and optimization system designed specifically for Squid-Cache proxy servers.
For detailed information about specific subsystems:
BlackWeb is a domain blocklist aggregation system that collects, processes, and unifies over 100 public blocklists containing domains associated with:
The system processes raw blocklists through an 8-stage validation pipeline to produce a single, optimized file (blackweb.txt) formatted for Squid-Cache ACL rules. Unlike other blocklist aggregators, BlackWeb specifically targets proxy server deployments and emphasizes data quality through DNS validation, TLD verification, and Squid compatibility testing.
Important: BlackWeb is not a blacklist verification service. It consolidates existing public blocklists without independently verifying each domain. Users must understand that domains are blocked based on their inclusion in upstream sources (README.md733-735).
Sources: README.md1-27 README-es.md1-27
The following table presents current statistics for the BlackWeb blocklist:
| Metric | Value | Description |
|---|---|---|
| Blocked Domains | 4,772,375 | Total unique domains in final blocklist |
| File Size | 118.8 MB | Uncompressed size of blackweb.txt |
| Format | Squid ACL | Domain entries prefixed with . for subdomain matching |
| Update Frequency | Manual/On-demand | Users run bwupdate.sh to regenerate |
| Source Lists | 100+ | Public blocklists aggregated (see Public Blocklist Sources) |
| Processing Stages | 8 | Sequential validation and optimization pipeline |
| Validation Method | DNS Lookup | Dual-pass DNS verification to exclude invalid domains |
The domain count represents approximately 35% data reduction from the raw input (~7.2M domains) after deduplication, DNS validation, and filtering (README.md24-26).
Sources: README.md20-26
BlackWeb follows a three-phase architecture: Generation, Output, and Deployment.
Architecture: BlackWeb System Context
The generation phase aggregates and processes external blocklists using bwupdate.sh as the main orchestrator. The output phase produces the final blackweb.txt file with an accompanying SHA256 checksum. The deployment phase integrates this blocklist into Squid-Cache proxy servers to enforce access control policies.
Sources: README.md18 README.md288-344
| Component | Type | Purpose | Location |
|---|---|---|---|
bwupdate.sh | Bash Script | Main update orchestrator; executes 8-stage processing pipeline | bwupdate/bwupdate.sh |
parse_domain.py | Python Script | Domain parsing and normalization | bwupdate/tools/parse_domain.py |
debugerror.py | Python Script | Squid error log analysis and correction | bwupdate/tools/debugerror.py |
checksources.sh | Bash Script | Source verification utility for individual domains | bwupdate/tools/checksources.sh |
squid_install.sh | Bash Script | Squid-Cache installation and configuration helper | Embedded in README.md312-344 |
| List File | Entries | Purpose | Used In Stage |
|---|---|---|---|
debugbl.txt | ~7,000 | Debug blacklist for testing | Domain Debugging |
debugwl.txt | ~292,000 | Whitelist (Google, Microsoft domains) | Domain Debugging |
blocktlds.txt | 136 TLDs | Malicious TLD blocking | TLD Validation |
streaming.txt | 899 | Optional streaming service domains | Post-deployment |
allowdomains.txt | User-defined | Essential services whitelist | Post-deployment |
blockdomains.txt | User-defined | Custom domain blocks | Post-deployment |
See Configuration Lists Reference for detailed documentation of each list file.
| Artifact | Format | Purpose | Generation |
|---|---|---|---|
blackweb.txt | Squid ACL format | Final blocklist (.domain.com format) | Generated by bwupdate.sh |
blackweb.txt.sha256 | SHA256 hash | Integrity verification | Generated alongside blackweb.txt |
blackweb.tar.gz | Compressed archive | Distribution format (may be multipart) | Manual compression |
Sources: README.md296-344 README.md346-348
The bwupdate.sh script orchestrates an 8-stage sequential pipeline. Each stage performs specific transformations and validations:
Diagram: 8-Stage Processing Pipeline in bwupdate.sh
.sub.example.com → .example.com), applies debugwl.txt whitelist (Domain Debugging and Normalization)bücher.com → xn--bcher-kva.com) (README.md400-428)PROCS variable) to exclude non-existent domains (DNS Validation and Parallel Processing).gov, .mil, etc.) (README.md529-547)blackweb.txt against Squid-Cache, uses debugerror.py to correct errors (Squid Testing and Error Correction)The pipeline is resumable from the DNS Lookup stage if interrupted (README.md559-563).
Sources: README.md288-558
This diagram maps high-level system concepts to specific code files and constructs:
Diagram: Code Entity Relationships in BlackWeb
This diagram shows how major code files and configuration constructs interact during the generation and deployment process. The PROCS variable mentioned is defined in README.md498-502 and controls parallel DNS query concurrency.
Sources: README.md104-284 README.md296-558
BlackWeb outputs domains in Squid-Cache dstdomain ACL format. Each line represents a domain with leading dot notation for subdomain matching:
Squid ACL Format Example:
.example.com # Matches example.com and all subdomains
.malware.net # Matches malware.net and *.malware.net
.phishing-site.org # Matches phishing-site.org and all subdomains
Basic Squid Configuration (README.md104-122):
Advanced Configuration with Multiple ACLs (README.md252-284):
Critical: ACL rule order matters. Squid evaluates rules sequentially, so allowdomains must appear before blackweb to prevent false positives. See ACL Rule Evaluation Order for detailed explanation.
Sources: README.md104-284
BlackWeb supports three deployment tiers based on user requirements:
blackweb.txt from GitHub releasesTarget Users: System administrators needing standard malware/ad blocking
File: README.md40-102
blackweb.txt plus optional lists:
allowdomains.txt - Whitelist essential servicesblockdomains.txt - Custom blocksblocktlds.txt - TLD-level blockingstreaming.txt - Bandwidth controlTarget Users: Network administrators with specific filtering requirements
File: README.md124-284
wget, git, curl, python3, etc.)bwupdate.sh to generate custom blackweb.txtTarget Users: Security researchers, organizations with custom blocklist sources
Files: README.md288-344
Sources: README.md36-344
BlackWeb is distributed through GitHub in compressed archive format. Large files may be split into multipart archives:
Standard Download (README.md42-46):
Multipart Archive Handling (README.md48-94):
When GitHub's file size limit is exceeded, archives are split (.aa, .ab, .ac, etc.). A bash script concatenates parts before extraction.
Integrity Verification (README.md96-102):
The checksum verification prevents deployment of corrupted or tampered blocklists.
Sources: README.md42-102
Required packages (README.md302-308):
wget, git, curl - Download utilitieslibnotify-bin - Desktop notificationsperl - Text processingtar, rar, unrar, unzip, zip, gzip - Archive handlingpython-is-python3 - Python symlinkidn2 - Internationalized domain name conversioniconv - Character encoding conversionHardware Considerations:
The DNS validation stage is resource-intensive. The PROCS variable (README.md498-502) controls parallelism:
PROCS=$(nproc) - One process per CPU corePROCS=$(($(nproc) * 2)) - Two processes per corePROCS=$(($(nproc) * 4)) - Four processes per corePROCS=$(($(nproc) * 8)) - Use with caution on high-bandwidth networksSources: README.md302-327 README.md494-515
BlackWeb provides a utility script to identify which upstream sources contain a specific domain. This is essential for false positive investigation:
checksources.sh Usage (README.md738-757):
This transparency model requires users to contact upstream list maintainers for domain removal. Once removed from the source, the domain automatically disappears from BlackWeb in the next generation cycle (README.md733-736).
See Source Verification (checksources.sh) for detailed utility documentation.
Sources: README.md733-757
BlackWeb is designed exclusively for Squid-Cache (README.md733):
BlackWeb aggregates public blocklists without independent domain verification (README.md734):
Updates are manual/on-demand (README.md292):
blackweb.txt updated by maintainersbwupdate.shThe system excludes government-related TLDs (.gov, .mil, etc.) by design (README.md529-547). This is a policy decision, not a technical limitation.
Sources: README.md290-293 README.md727-757
For detailed information about specific BlackWeb subsystems:
Sources: README.md1-801 README-es.md1-801
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.