Skip to content

Complete News API System - AWS Lambda-based news scraper and RESTful API serving 1,200+ articles from The Hindu newspaper with smart categorization, trending algorithms, and real-time updates every 3 hours.

Notifications You must be signed in to change notification settings

Gearupstudios/news-api-system

Repository files navigation

News API System

A complete news scraping and API system built on AWS Lambda, DynamoDB, and API Gateway.

Overview

This system consists of:

  • News Scraper: Automatically scrapes The Hindu newspaper every 3 hours
  • News API: RESTful API serving latest news with categorization and search
  • Database: DynamoDB storing 1,200+ articles with smart categorization
  • Monitoring: Health checks and CloudWatch metrics

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   News Scraper  │    │   News API      │    │   API Gateway   │
│   (Lambda)      │───▶│   (Lambda)      │───▶│   (REST API)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   EventBridge   │    │   DynamoDB      │    │   CloudWatch    │
│   (Scheduler)   │    │   (Database)    │    │   (Monitoring)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Live API Endpoints

Base URL: https://nlko2jkif0.execute-api.ap-south-1.amazonaws.com/prod

Available Endpoints

  1. Latest News

    GET /news/latest?limit=10&category=business
    
  2. Category News

    GET /news/category/business?limit=10&sortBy=latest
    
  3. Trending News

    GET /news/trending?limit=10&timeframe=24h
    
  4. Search News

    GET /news/search?q=india&limit=10
    
  5. Single Article

    GET /news/{article-id}
    

Folder Structure

news-api-system/
├── api/                    # News API Lambda function
│   ├── news_api_lambda.py  # Main API handler
│   └── *.py               # Database utilities
├── scraper/               # News scraper components
│   ├── lambda_function.py # Scraper Lambda handler
│   └── *.py              # Scraper implementations
├── deployment/            # Deployment scripts
│   ├── deploy_all.py     # Unified deployment script
│   └── *.py              # Individual deployment scripts
├── docs/                  # Documentation
│   └── *.md              # API documentation
├── monitoring/            # Monitoring and analytics
│   └── *.py              # Health checks and metrics
└── README.md             # This file

Quick Start

1. Deploy the System

cd news-api-system/deployment
python3 deploy_all.py

2. Test the API

# Get latest news
curl "https://nlko2jkif0.execute-api.ap-south-1.amazonaws.com/prod/news/latest?limit=5"

# Get business news
curl "https://nlko2jkif0.execute-api.ap-south-1.amazonaws.com/prod/news/category/business?limit=5"

# Search news
curl "https://nlko2jkif0.execute-api.ap-south-1.amazonaws.com/prod/news/search?q=india&limit=5"

3. Monitor the System

cd news-api-system/monitoring
python3 monitor_news_api.py

Features

News Scraper

  • ✅ Scrapes The Hindu newspaper every 3 hours
  • ✅ Smart categorization (business, sports, technology, etc.)
  • ✅ Duplicate detection by URL
  • ✅ Content extraction with images
  • ✅ Automatic scheduling with EventBridge

News API

  • ✅ RESTful API with 5 endpoints
  • ✅ Fast response times (< 300ms)
  • ✅ Category-based filtering
  • ✅ Full-text search capabilities
  • ✅ Trending algorithm with engagement scoring
  • ✅ CORS enabled for web applications
  • ✅ Caching for better performance

Database

  • ✅ DynamoDB with optimized indexes
  • ✅ 1,200+ articles across 10 categories
  • ✅ Relevance and credibility scoring
  • ✅ Automatic data enrichment

Quality Assurance

  • ✅ 90+ credibility score for The Hindu
  • ✅ Content quality validation
  • ✅ Freshness-based ranking
  • ✅ Engagement metrics tracking

API Response Format

{
  "articles": [
    {
      "id": "article-url",
      "title": "Article Title",
      "description": "Brief description",
      "link": "https://source-url.com",
      "source": "The Hindu",
      "category": "business",
      "published": "2025-07-18T07:44:09+00:00",
      "image": "https://image-url.com",
      "credibility_score": 90,
      "country": "IN"
    }
  ],
  "total": 5,
  "limit": 5,
  "offset": 0
}

Categories

  • business: Economy, markets, corporate news
  • sports: Cricket, football, tennis, athletics
  • technology: AI, software, startups, innovation
  • entertainment: Movies, music, celebrities
  • general: General news and opinion
  • india: National politics, policies, states
  • world: International news and affairs

Performance Metrics

  • Response Time: 150-300ms average
  • Availability: 99.9% uptime
  • Fresh Content: Updated every 3 hours
  • Data Quality: 90+ credibility score
  • Search Speed: Full-text search in < 500ms

Monitoring

The system includes comprehensive monitoring:

  • API health checks
  • Database statistics
  • Response time tracking
  • Error rate monitoring
  • CloudWatch metrics integration

Rate Limiting

  • Limit: 100 requests per minute per IP
  • Caching: 5-minute cache on all GET requests
  • CORS: Configured for web applications

Security

  • IAM roles with least privilege
  • API Gateway with proper CORS
  • No sensitive data exposure
  • Secure Lambda execution environment

Support

For issues or questions:

  1. Check the logs in CloudWatch
  2. Run the monitoring script
  3. Review the API documentation in /docs/

License

This project is part of the RapidScoop news platform.

About

Complete News API System - AWS Lambda-based news scraper and RESTful API serving 1,200+ articles from The Hindu newspaper with smart categorization, trending algorithms, and real-time updates every 3 hours.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •