DocuFetch is an intelligent document harvesting tool designed to automatically discover, track, and download academic papers and news articles based on user-specified keywords.
- Keyword-based document discovery
- Multiple source support:
- Academic Sources:
- arXiv
- Google Scholar
- Semantic Scholar
- CORE
- Crossref
- Unpaywall
- PubMed
- DOAJ (Directory of Open Access Journals)
- OpenAIRE
- News Sources:
- Various news websites via newspaper3k
- Academic Sources:
- Selective API Search: Choose to search only academic sources, only news sources, or both
- Advanced deduplication mechanism
- Configurable update intervals
- Document statistics tracking
- Terminal-based interaction
- Automatic PDF downloads for academic papers
- Preview mode to check document counts before downloading
pip install -r requirements.txtDocuFetch is a command-line tool with several commands:
# Add keywords to monitor
python src/main.py add "machine learning" "artificial intelligence"
# Remove keywords
python src/main.py remove "machine learning"
# Clear all keywords
python src/main.py clear
# List current keywords
python src/main.py list
# Configure document sources
python src/main.py sources --arxiv --no-scholar --news
python src/main.py sources --semantic-scholar --core --crossref
python src/main.py sources --unpaywall --pubmed --doaj --openaire
python src/main.py sources --list
# Configure API keys for sources that require them
python src/main.py api core YOUR_CORE_API_KEY
python src/main.py api crossref your.email@example.com
python src/main.py api unpaywall your.email@example.com
python src/main.py api pubmed YOUR_PUBMED_API_KEY
python src/main.py api doaj YOUR_DOAJ_API_KEY
# Set update interval (in hours)
python src/main.py interval 24
# Manually trigger document discovery (with preview)
python src/main.py fetch
python src/main.py fetch --academic-only # Search only academic sources
python src/main.py fetch --news-only # Search only news sources
# Preview documents without downloading
python src/main.py preview
python src/main.py preview --academic-only # Preview only academic sources
python src/main.py preview --news-only # Preview only news sources
# Open downloads directory
python src/main.py open
# Start continuous monitoring
python src/main.py monitor
python src/main.py monitor --academic-only # Monitor only academic sources
python src/main.py monitor --news-only # Monitor only news sources
# Show download statistics
python src/main.py statsdocuFetch/
├── src/ # Source code
│ ├── main.py # Entry point and CLI interface
│ ├── sources/ # Document source implementations
│ │ ├── __init__.py
│ │ ├── academic.py # Consolidated academic paper sources
│ │ ├── news.py # News article sources
│ │ └── manager.py # Source management
│ └── utils.py # Utility functions
├── logs/ # Application logs
├── requirements.txt # Project dependencies
└── README.md # Project documentation
DocuFetch stores its configuration in ~/.docufetch/config.json. The default configuration is:
{
"keywords": [],
"sources": {
"arxiv": true,
"scholar": true,
"news": true,
"semantic_scholar": false,
"core": false,
"crossref": false,
"unpaywall": false,
"pubmed": false,
"doaj": false,
"openaire": false
},
"api_keys": {
"core": "",
"crossref_email": "",
"unpaywall_email": "",
"pubmed_api_key": "",
"doaj_api_key": "",
"openaire_api_key": ""
},
"update_interval": 12,
"max_results_per_source": 50,
"download_pdfs": true
}Downloaded documents are stored in ~/DocuFetch_Downloads/:
- Academic papers:
~/DocuFetch_Downloads/academic/ - News articles:
~/DocuFetch_Downloads/news/
DocuFetch implements an advanced deduplication mechanism to prevent downloading the same document from different sources:
- Each document is assigned a unique ID based on its title and authors
- When a document is discovered, its ID is checked against previously downloaded documents
- If a match is found, the document is skipped to avoid duplication
- Deduplication metadata is stored in
~/DocuFetch_Downloads/academic/dedup/
Some sources require API keys or email addresses for better rate limits:
- CORE API: Requires an API key. Register at CORE API
- Crossref API: Benefits from an email address for better rate limits
- Unpaywall API: Requires an email address for better rate limits
- PubMed API: Requires an API key. Register at PubMed API
- DOAJ API: Requires an API key. Register at DOAJ API
- OpenAIRE API: Requires an API key. Register at OpenAIRE API
- Consolidated Academic Sources: All academic source implementations have been merged into a single file (
academic.py) for better maintainability - Selective API Search: Added the ability to search only academic sources, only news sources, or both using the
--academic-onlyand--news-onlyflags - Improved Code Organization: Streamlined the codebase by removing redundant files and improving the overall structure
requests: HTTP requestsbeautifulsoup4: HTML parsingscholarly: Google Scholar APInewspaper3k: News article scrapingarxiv: arXiv APItqdm: Progress barscolorama: Terminal colorspyfiglet: ASCII art bannersschedule: Task schedulingpython-dotenv: Environment variable management
MIT