The following project is part of UDACITY - Data Engineer Nanodegree. It consist in creating an ETL from data sources stored in JSON to a AWS Redshift database.
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
The project consist in building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset you'll be working with are partitioned by year and month.
- Create a credentials.cfg with your AWS User KEY and SECRET. The structure of the file should be similar to the credentials_empty.cfg file found under utils folder. The AWS user should have permissions to create other users and assign policies)
- Execute create_enviroment.py to create a user to access S3 and Redshift, a Redshift cluster, and to open the TCP port. If you wish to change any of the confifurations of the cluster you can modify dwh_original.cfg
- Execute create_tables.py to drop tables if they already exist, and/or create new tables.
- Execute etl.py to run the etl.
- Once you finish, you can run delete_environment.py (CAUTION! It will delete the redshift cluster and all its data).
- SONGPLAYS - records in log data associated with song plays i.e. records with page NextSong - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- USERS - users in the app - user_id, first_name, last_name, gender, level
- SONGS - songs in music database - song_id, title, artist_id, year, duration
- ARTISTS - artists in music database - artist_id, name, location, latitude, longitude
- TIME - timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday
Select b.user_id, u.first_name || u.last_name as name_Surname, b.activity
FROM (SELECT user_id, count(user_id) as activity
FROM songplays
GROUP BY user_id
ORDER BY 2 desc
LIMIT 10) b
LEFT JOIN users u
ON u.user_id = b.user_id
order by 3 desc
user_id | name_surname | activity |
---|---|---|
49 | ChloeCuevas | 772 |
80 | TeganLevine | 740 |
97 | KateHarrell | 595 |
15 | LilyKoch | 495 |
44 | AleenaKirby | 439 |
29 | JacquelineLynch | 381 |
24 | LaylaGriffin | 349 |
88 | MohammadRodriguez | 312 |
73 | JacobKlein | 305 |
36 | MatthewJones | 280 |
Select s.title, a.listend_times
FROM (SELECT song_id, count(song_id) as listend_times
FROM songplays p
GROUP BY song_id
ORDER BY 2 desc
LIMIT 1) a
LEFT JOIN songs s
ON a.song_id = s.song_id
title | listend_times |
---|---|
You're The One | 37 |