Skip to content

ETL from data sources stored in JSON to AWS Redshift database. Written in Python

Notifications You must be signed in to change notification settings

miguruiz/sparkify-etl-aws-redshift

Repository files navigation

Sparkify AWS Redshift ETL

The following project is part of UDACITY - Data Engineer Nanodegree. It consist in creating an ETL from data sources stored in JSON to a AWS Redshift database.

Project Description

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The project consist in building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables

Datasources

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

Song Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset you'll be working with are partitioned by year and month.

Execution Instructions

  1. Create a credentials.cfg with your AWS User KEY and SECRET. The structure of the file should be similar to the credentials_empty.cfg file found under utils folder. The AWS user should have permissions to create other users and assign policies)
  2. Execute create_enviroment.py to create a user to access S3 and Redshift, a Redshift cluster, and to open the TCP port. If you wish to change any of the confifurations of the cluster you can modify dwh_original.cfg
  3. Execute create_tables.py to drop tables if they already exist, and/or create new tables.
  4. Execute etl.py to run the etl.
  5. Once you finish, you can run delete_environment.py (CAUTION! It will delete the redshift cluster and all its data).

Output tables

Fact Table

  • SONGPLAYS - records in log data associated with song plays i.e. records with page NextSong - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  • USERS - users in the app - user_id, first_name, last_name, gender, level
  • SONGS - songs in music database - song_id, title, artist_id, year, duration
  • ARTISTS - artists in music database - artist_id, name, location, latitude, longitude
  • TIME - timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday

Sample queries

Top 10 Active users

Select b.user_id, u.first_name ||  u.last_name as name_Surname, b.activity
FROM (SELECT user_id, count(user_id) as activity  
        FROM songplays 
        GROUP BY user_id 
        ORDER BY 2 desc
        LIMIT 10) b
LEFT JOIN users u
ON u.user_id = b.user_id
order by 3 desc
user_id name_surname activity
49 ChloeCuevas 772
80 TeganLevine 740
97 KateHarrell 595
15 LilyKoch 495
44 AleenaKirby 439
29 JacquelineLynch 381
24 LaylaGriffin 349
88 MohammadRodriguez 312
73 JacobKlein 305
36 MatthewJones 280

Top listened song

Select s.title, a.listend_times
FROM (SELECT song_id, count(song_id) as  listend_times  
    FROM songplays p
    GROUP BY song_id 
    ORDER BY 2 desc
    LIMIT 1) a
LEFT JOIN songs s
ON a.song_id = s.song_id
title listend_times
You're The One 37

About

ETL from data sources stored in JSON to AWS Redshift database. Written in Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages