A tool for extracting, cleaning, and exporting URLs from Markdown files.
- Extract URLs: Recursively scan directories for Markdown files and extract all URLs
- URL Cleaning:
- Remove tracking parameters (utm_source, fbclid, gclid, etc.)
- Normalize YouTube URLs to canonical format (
https://www.youtube.com/watch?v=VIDEO_ID)
- Filtering Options:
- Filter by domain (e.g., only github.com URLs)
- Filter by protocol (http, https, ftp, etc.)
- Multiple Output Formats:
- Plain text output to terminal (stdout)
- Text file with one URL per line
- CSV file with URLs, source files, and link text
- HTML bookmarks file importable into browsers
git clone https://github.com/tonydub/md-url-extractor.git
cd md-url-extractor
cargo build --releaseThe compiled binary will be available at target/release/md-url-extractor.
cargo install md-url-extractormd-url-extractor [OPTIONS] <input_dir>
<input_dir>: Directory containing Markdown files to scan
-f, --format <FORMAT>: Output format [default: stdout] [possible values: stdout, text, csv, html]-o, --output <FILE>: Output file path (required for text, csv, html formats)--domain <DOMAIN>: Filter URLs by domain--protocol <PROTOCOL>: Filter URLs by protocol [default: http https] [possible values: http, https, ftp, file, mailto]-h, --help: Show help information-V, --version: Show version information
Extract URLs from Markdown files and print to console:
md-url-extractor ~/Documents/notes/Extract only GitHub links:
md-url-extractor ~/Documents/notes/ --domain github.comExtract URLs and save to CSV with source information:
md-url-extractor ~/Documents/notes/ --format csv --output links.csvCreate a bookmarks file that can be imported into browsers:
md-url-extractor ~/Documents/notes/ --format html --output bookmarks.htmlExtract only HTTPS links:
md-url-extractor ~/Documents/notes/ --protocol httpsThe following parameters are automatically removed from URLs:
utm_*parameters (utm_source, utm_medium, utm_campaign, etc.)- Facebook click identifier (
fbclid) - Google click identifier (
gclid) - Microsoft click identifier (
msclkid) - Generic tracking parameters (
ref,source)
Example:
- From:
https://example.com/article?id=123&utm_source=twitter&utm_medium=social - To:
https://example.com/article?id=123
All YouTube URL formats are converted to the standard format:
Examples:
https://youtu.be/dQw4w9WgXcQ→https://www.youtube.com/watch?v=dQw4w9WgXcQhttps://youtube.com/embed/dQw4w9WgXcQ→https://www.youtube.com/watch?v=dQw4w9WgXcQhttps://www.youtube.com/watch?v=dQw4w9WgXcQ&t=120&feature=recommended→https://www.youtube.com/watch?v=dQw4w9WgXcQ
This tool is built with a clean architecture approach:
- Domain Layer: Core business logic for URL processing and cleaning
- Application Layer: Orchestrates the extraction and processing workflow
- Infrastructure Layer: Handles CLI arguments, file I/O, and output formatting
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/your-feature-name) - Commit your changes (
git commit -m 'Add some feature') - Push to the branch (
git push origin feature/your-feature-name) - Open a Pull Request
- Support for additional URL cleaning rules
- Option to check if links are valid
- Metadata extraction from link destinations
- Support for additional output formats
This project is licensed under the MIT License - see the LICENSE file for details.