Sparkify AWS Redshift ETL

The following project is part of UDACITY - Data Engineer Nanodegree. It consist in creating an ETL from data sources stored in JSON to a AWS Redshift database.

Project Description

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The project consist in building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables

Datasources

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

Song Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset you'll be working with are partitioned by year and month.

Execution Instructions

Create a credentials.cfg with your AWS User KEY and SECRET. The structure of the file should be similar to the credentials_empty.cfg file found under utils folder. The AWS user should have permissions to create other users and assign policies)
Execute create_enviroment.py to create a user to access S3 and Redshift, a Redshift cluster, and to open the TCP port. If you wish to change any of the confifurations of the cluster you can modify dwh_original.cfg
Execute create_tables.py to drop tables if they already exist, and/or create new tables.
Execute etl.py to run the etl.
Once you finish, you can run delete_environment.py (CAUTION! It will delete the redshift cluster and all its data).

Output tables

Fact Table

SONGPLAYS - records in log data associated with song plays i.e. records with page NextSong - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

USERS - users in the app - user_id, first_name, last_name, gender, level
SONGS - songs in music database - song_id, title, artist_id, year, duration
ARTISTS - artists in music database - artist_id, name, location, latitude, longitude
TIME - timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday

Sample queries

Top 10 Active users

Select b.user_id, u.first_name ||  u.last_name as name_Surname, b.activity
FROM (SELECT user_id, count(user_id) as activity  
        FROM songplays 
        GROUP BY user_id 
        ORDER BY 2 desc
        LIMIT 10) b
LEFT JOIN users u
ON u.user_id = b.user_id
order by 3 desc

user_id	name_surname	activity
49	ChloeCuevas	772
80	TeganLevine	740
97	KateHarrell	595
15	LilyKoch	495
44	AleenaKirby	439
29	JacquelineLynch	381
24	LaylaGriffin	349
88	MohammadRodriguez	312
73	JacobKlein	305
36	MatthewJones	280

Top listened song

Select s.title, a.listend_times
FROM (SELECT song_id, count(song_id) as  listend_times  
    FROM songplays p
    GROUP BY song_id 
    ORDER BY 2 desc
    LIMIT 1) a
LEFT JOIN songs s
ON a.song_id = s.song_id

title	listend_times
You're The One	37

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
create_environment.py		create_environment.py
create_tables.py		create_tables.py
delete_environment.py		delete_environment.py
dwh_original.cfg		dwh_original.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify AWS Redshift ETL

Project Description

Datasources

Song Dataset

Song Dataset

Execution Instructions

Output tables

Fact Table

Dimension Tables

Sample queries

Top 10 Active users

Top listened song

About

Releases

Packages

Languages

miguruiz/sparkify-etl-aws-redshift

Folders and files

Latest commit

History

Repository files navigation

Sparkify AWS Redshift ETL

Project Description

Datasources

Song Dataset

Song Dataset

Execution Instructions

Output tables

Fact Table

Dimension Tables

Sample queries

Top 10 Active users

Top listened song

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages