Data Modeling with PostgreSQL

Project 1: Udacity Data Engineering Nanodegree

Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

They'd like to implement OLAP using Postgres database with tables designed to optimize queries on song play analysis.

Data modelling with Postgres

Designing Database schema | fact and dimension tables for a star schema

Building ETL pipeline using Python

Project Datasets

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

{'artist_id': {0: 'AR8IEZO1187B99055E'}, 'artist_latitude': {0: nan}, 'artist_location': {0: ''}, 'artist_longitude': {0: nan}, 'artist_name': {0: 'Marc Shaiman'}, 'duration': {0: 149.86404}, 'num_songs': {0: 1}, 'song_id': {0: 'SOINLJW12A8C13314C'}, 'title': {0: 'City Slickers'}, 'year': {0: 2008}}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations, partitioned by year and month...

{'artist': {0: 'Sydney Youngblood', 1: 'Gang Starr'}, 'auth': {0: 'Logged In', 1: 'Logged In'}, 'firstName': {0: 'Jacob', 1: 'Layla'}, 'gender': {0: 'M', 1: 'F'}, 'itemInSession': {0: 53, 1: 88}, 'lastName': {0: 'Klein', 1: 'Griffin'}, 'length': {0: 238.07955, 1: 151.92771}, 'level': {0: 'paid', 1: 'paid'}, 'location': {0: 'Tampa-St. Petersburg-Clearwater, FL', 1: 'Lake Havasu City-Kingman, AZ'}, 'method': {0: 'PUT', 1: 'PUT'}, 'page': {0: 'NextSong', 1: 'NextSong'}, 'registration': {0: 1540558108796.0, 1: 1541057188796.0}, 'sessionId': {0: 954, 1: 984}, 'song': {0: "Ain't No Sunshine", 1: 'My Advice 2 You (Explicit)'}, 'status': {0: 200, 1: 200}, 'ts': {0: 1543449657796, 1: 1543449690796}, 'userAgent': {0: '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"', 1: '"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"'}, 'userId': {0: '73', 1: '24'}}

Schema for Song Play Analysis

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong

songplay_id

start_time

user_id

level

song_id

artist_id

session_id

location

user_agent

Dimension Tables

users - users in the app

user_id

first_name

last_name

gender

level

songs - songs in music database

song_id

title

artist_id

year

duration

artists - artists in music database

artist_id

name

location

latitude

longitude

time - timestamps of records in songplays broken down into specific units

start_time

hour

day

week

month

year

weekday

Project Template

Project workspace includes six files:

test.ipynb - displays the first few rows of each table.

create_tables.py - drops and creates tables.

etl.ipynb - reads and processes a single file from song_data and log_data and loads the data into your tables.

etl.py - reads and processes files from song_data and log_data and loads them into tables.

sql_queries.py - contains sql queries, and is imported into the last three files above.

README.md - provides project description.

END-T0-END Execution Process

Steps to be followed as mension bellow:

Execute: sql_queries.py | Verify no ERROR

Execute: create_tables.py | Verify no ERROR

Execute: etl.py | Verify no ERROR

Execute: test.ipynb | Verify data has been loded successfully

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Modeling with PostgreSQL

Introduction

Project Datasets

Schema for Song Play Analysis

Fact Table

Dimension Tables

Project Template

END-T0-END Execution Process

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
log-data.zip		log-data.zip
song-data.zip		song-data.zip
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

GourBera/Data-Modeling-with-PostgreSQL

Folders and files

Latest commit

History

Repository files navigation

Data Modeling with PostgreSQL

Introduction

Project Datasets

Schema for Song Play Analysis

Fact Table

Dimension Tables

Project Template

END-T0-END Execution Process

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages