Skip to content

GourBera/Data-Modeling-with-PostgreSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Modeling with PostgreSQL

Project 1: Udacity Data Engineering Nanodegree

Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

They'd like to implement OLAP using Postgres database with tables designed to optimize queries on song play analysis.

  • Data modelling with Postgres
  • Designing Database schema | fact and dimension tables for a star schema
  • Building ETL pipeline using Python

Project Datasets

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

{'artist_id': {0: 'AR8IEZO1187B99055E'}, 'artist_latitude': {0: nan}, 'artist_location': {0: ''}, 'artist_longitude': {0: nan}, 'artist_name': {0: 'Marc Shaiman'}, 'duration': {0: 149.86404}, 'num_songs': {0: 1}, 'song_id': {0: 'SOINLJW12A8C13314C'}, 'title': {0: 'City Slickers'}, 'year': {0: 2008}}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations, partitioned by year and month...

{'artist': {0: 'Sydney Youngblood', 1: 'Gang Starr'}, 'auth': {0: 'Logged In', 1: 'Logged In'}, 'firstName': {0: 'Jacob', 1: 'Layla'}, 'gender': {0: 'M', 1: 'F'}, 'itemInSession': {0: 53, 1: 88}, 'lastName': {0: 'Klein', 1: 'Griffin'}, 'length': {0: 238.07955, 1: 151.92771}, 'level': {0: 'paid', 1: 'paid'}, 'location': {0: 'Tampa-St. Petersburg-Clearwater, FL', 1: 'Lake Havasu City-Kingman, AZ'}, 'method': {0: 'PUT', 1: 'PUT'}, 'page': {0: 'NextSong', 1: 'NextSong'}, 'registration': {0: 1540558108796.0, 1: 1541057188796.0}, 'sessionId': {0: 954, 1: 984}, 'song': {0: "Ain't No Sunshine", 1: 'My Advice 2 You (Explicit)'}, 'status': {0: 200, 1: 200}, 'ts': {0: 1543449657796, 1: 1543449690796}, 'userAgent': {0: '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"', 1: '"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"'}, 'userId': {0: '73', 1: '24'}}

Schema for Song Play Analysis

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong

  • songplay_id
  • start_time
  • user_id
  • level
  • song_id
  • artist_id
  • session_id
  • location
  • user_agent

Dimension Tables

users - users in the app

  • user_id
  • first_name
  • last_name
  • gender
  • level

songs - songs in music database

  • song_id
  • title
  • artist_id
  • year
  • duration

artists - artists in music database

  • artist_id
  • name
  • location
  • latitude
  • longitude

time - timestamps of records in songplays broken down into specific units

  • start_time
  • hour
  • day
  • week
  • month
  • year
  • weekday

Project Template

Project workspace includes six files:

test.ipynb - displays the first few rows of each table.

create_tables.py - drops and creates tables.

etl.ipynb - reads and processes a single file from song_data and log_data and loads the data into your tables.

etl.py - reads and processes files from song_data and log_data and loads them into tables.

sql_queries.py - contains sql queries, and is imported into the last three files above.

README.md - provides project description.

END-T0-END Execution Process

Steps to be followed as mension bellow:

  1. Execute: sql_queries.py | Verify no ERROR
  2. Execute: create_tables.py | Verify no ERROR
  3. Execute: etl.py | Verify no ERROR
  4. Execute: test.ipynb | Verify data has been loded successfully

About

Udacity Project 1: Data Modeling with PostgreSQL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published