This is a cache of https://www.elastic.co/search-labs/blog/spotify-wrapped-part-02. It is a snapshot of the page at 2025-02-26T00:33:43.174+0000.
Spotify Wrapped part 2: Diving deeper into the data - Elasticsearch Labs

Spotify Wrapped part 2: Diving deeper into the data

We will dive deeper into your Spotify data than ever before and explore connections you didn't even know existed.

In the first part of this series, written by Iulia Feroli, we talked about how to get your Spotify Wrapped data and visualize it in Kibana. In part 2, we're diving deeper into the data to see what else we can find out. To do this we're going to leverage a bit of a different approach and use Spotify to Elasticsearch to index the data into Elasticsearch. This tool is a bit more advanced and requires a bit more setup, but it is worth it. The data is more structured and we can ask more complex questions.

What is different from the first approach?

In the first blog we used the Spotify export directly and didn't perform any normalisation tasks, or any other data processing. This time we will use the same data but we will perform some data processing to make the data more usable. This will allow us to answer much more complex questions such as:

  • What is the average duration of a song in my top 100?
  • What is the average popularity of a song in my top 100?
  • What is the median listening duration to a song?
  • What is my most skipped tracked?
  • When do I like to skip tracks?
  • Am I listening to a particular hour of the day more than others?
  • Am I listening to a particular day of the week more than others?
  • Is a month of particular interest?
  • What is the artist with the longest listening time?

Spotify wrapped is a fun experience every year showing you what you listened to this year. It does not give you year over the year changes and thus you might miss some artists that were once in your top 10, but now have vanished.

Data processing

There is a large difference in the way we process the data in the first and the second post. If you want to keep working with the data from the first post, you will need to account for some field name changes, as well as need to revert to ES|QL to do certain extractions like hour of day on the fly.

Nonetheless, you all should be able to follow this post. The data processing is done in the Spotify to Elasticsearch repository involves asking the Spotify API for the duration of the song, popularity and also renames and enhances some fields. For example the artist field in the Spotify export itself is just a String and does not represent features or multi-artist tracks

Dashboarding

I created a dashboard in Kibana to visualize the data. The dashboard is available here and you can import it into your Kibana instance. The dashboard is quite extensive and answers many of the above questions.

Let's get into some of the questions and how to answer them together!

What is the average duration of a song in my top 100?

To answer this question we can use Lens, or ES|QL. Let's explore all three options. Let's phrase this question correctly in an Elasticsearch manner. We want to find the top 100 songs and then calculate the average duration of all of those songs combined. In Elasticsearch terms that would be two aggregations:

  1. Figure out the top 100 songs
  2. Calculate the average duration of those 100 songs.

Lens

In Lens this is rather simple, create a new Lens, switch to a table and drag and drop the title field into the table. Then click on the title field and set the size to 100, as well as set accuracy mode. Then drag and drop the duration field into the table and use last value, because we really only need the last value of each of the songs duration. The same song will only have one duration. In the bottom of this last value aggregation is a dropdown for a summary row, select average and it will show it to you.

ES|QL

ES|QL is a pretty fresh language compared to DSL & aggregations, but it is very powerful and easy to use. To answer the same question in ES|QL you would write the following query:

Let me take you step through step of this ES|QL query:

  1. from spotify-history - This is the index pattern we are using.
  2. stats duration=max(duration), count=count() by title - This is the first aggregation, we are calculating the maximum duration of each song and the count of each song. We use max instead of last value as used in the Lens, that is because ES|QL right now does not have a first or last.
  3. sort count desc - We sort the songs by the count of each song, so the most listened to song is on top.
  4. limit 100 - We limit the result to the top 100 songs.
  5. stats Average duration of the songs=avg(duration) - We calculate the average duration of the songs.

Is a month of particular interest to me?

To answer this question we can use Lens with the help of runtime field and ES|QL. What do we notice straight away, that there is no field in the data that denotes the month directly, instead we need to calculate it from the @timestamp field. There are multiple ways to do this:

  1. Use a runtime field, to power the Lens
  2. ES|QL

I personally think that ES|QL is the neater and quicker solution.

That's it, nothing fancy needed to do, we can leverage the DATE_EXTRACT function to extract the month from the @timestamp field and then we can aggregate on it. Using the ES|QL visualisation we can drop that onto the dashboard.

What is my listening duration per artist per year?

The idea behind that is to see if an artist is just a one-time thing or if there is a reoccurrence. If I remember correctly, Spotify only shows you the top 5 artists in the yearly wrapped. Maybe your number 6 artist stays the same all the time, or they heavily change after the 10th position?

One of the simplest representation of this is a percentage bar chart. We can use Lens for this. Follow the steps along:

Drag and drop the listened_to_ms field. This field represents how long you listened to a song in milliseconds. Now per default Lens will create a median aggregation, we do not want that, alter that to a sum. In the top select percentage instead of stacked for the bar chart type. For the breakdown select artist and say top 10. In the Advanced dropdown don't forget to select accuracy mode. Now every color block represents how much you listened to this single artist. Depending on your timepicker the bars might represent values from days, to weeks, to months, to years. If you want a weekly breakdown, select the @timestamp and set the mininum interval to year. Now what we can tell in my case is that Fred Again.. is the artist I listened to most, nearly 12% of my total listening time was consumed by Fred Again... We also see that Fred Again.. dropped a bit in 2024, but Jamie XX grew largely. If we compare just the size of the bars. We can also tell that whilst Billie Eilish is constantly being played in 2024 the bar widthend. This means that I listened to Billie Eilish more in 2024 than in 2023.

What about the top tracks per artist per listening time versus overall listening time?

That's a mouthfull of a question. Let me try to explain what I want to say with that. Spotify tells you about the top song from a single artist, or your overall 5 top songs. Well, that's definitely interesting, but what about the breakdown of an artist? Is all my time consumed just by a single song that I play over and over again, or is that evenly distributed?

Create a new lens and select Treemap as type. For the metric, same as before: select sum and use listened_to_ms as the field. For the group by we need two values. The first one is artist and then add a second one with title. The intermediate result looks like this:

Let's change that to top 100 artists and deselect the other in the advanced dropdown, as well as enable accuracy mode. For title change that to top 10 and enable accuracy mode. The final result looks like this:

What does this tell us now exactly? Without looking at any time component, we can tell that over all my listening history with Spotify, I spent 5.67% listening to Fred Again... In particularly I spent 1.21% of that time, listening to Delilah (pull me out of this). It is interesting to see, if there is a single song that occupies an artist, or if there are other songs as well. The treemap itself is a nice form to represent such data distributions.

Do I listen on a particular hour and day?

Well, that we can answer super simple with a Lens visualisation leveraging the Heat Map. Create a new Lens, select Heat Map. For the Horizontal Axis select dayOfWeek field and set it to Top 7 instead of Top 3. For the Vertical Axis select the hourOfDay and for Cell Value just a simple Count of records. Now this will produce this panel:

There are a couple of annoying things around this Lens, that just disturb me when interpreting. Let's try and clean it up a bit. First of all, I don't care about the legend too much, use the symbol in the top with the triangle, square, circle and disable it.

Now the 2nd part that is annoying is the sorting of the days. It's Monday, Wednesday, Thursday, or anything else, depending on the values you have. The hourOfDay is correctly sorted. The way to sort the days is a funny hack and that is called to use Filters instead of Top Values. Click on dayOfWeek and select Filters, it should now look like this:

Now just start typing the days. One filter for each day. "dayOfWeek" : Monday and give it the label Monday and rinse and repeat.

One caveat in all of this though is, that Spotify provides the data in UTC+0 without any timezone information. Sure, they also provide the IP address and the country where you listened to and we could infere the timezone information from that, but that can be wonky and for countries like the U.S. that has multiple timezones, it can be too much of a hassle. This is important because Elasticsearch and Kibana have timezone support and by providing the correct timezone in the @timestamp field, Kibana would automatically adjust the time to your browser time.

It should look like this when finalized, and we can tell that I am a very active listener during working hours and less so on Saturdays and Sundays.

Conclusion

In this blog we dove a bit deeper into the intricacies that the Spotify data offers. We showed a few simple and quick ways to get some visualizations up and running. It is simply amazing to have this much control over your own listening history. Stay tuned for more follow up blog posts!

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Related content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself