Playful Python Problems
Learning to use the CSV Module
I’ve been wanting to use the CSV Module for some projects at work, but have never had the time to learn it. When I also wanted to audit the release dates on some old Jazz albums in iTunes, I thought this might be a perfect opportunity to spend a couple hours reading, experimenting and putting the module to practical use.
iTunes has an ‘Export Playlist’ option, which creates a CSV file of song metadata so I’d have plenty of material to work with. The first line of the playlist file has the headers:
…while the remaining lines contain the data:
Armed with the weighty Python 3 Standard Library by Example by Doug Hellmann, I started in.
The CSV Module is easy enough: after importing it, you open your CSV file and then create an object from it’s contents. You have the option of reading the data in as a flat list, or as a dictionary. For what I wanted to do, a dictionary seemed ideal; so, I started with this (straight from Hellman’s example):
Uh-oh - I hit this right off:
I knew that ‘UTF-8’ was a unicode text thing, but didn’t know why this would have an error loading such a simple text file. I started with the Wikipedia article on UTF-8, which seemed straightforward enough. Further reading revealed that iTunes playlists are exported in UTF-16 which starts with a Byte Order Mark (BOM) that is incomprehensible in UTF-8. I also found furious debate over which encoding was ‘best’, and even a manifesto! At least I felt I was in good company with my confusion.
So, my playlist file had characters that were unreadable by the CSV module’s default UTF-8 encoding. After a quick search on Stack Overflow I found that I needed to use
encoding='utf-16' instead of the example’s ‘rt’ :
Better - no error, but the dictionary it created was less than useful:
Reading a bit further into the Standard Library by Example, I found that the CSV module assumes comma separated values (duh), so I needed to specify:
So each ‘entry’ is a dictionary object, with the keys defined by the initial line of the CSV file. As each dictionary uses the same set of keys, you can easily find any piece of data using something like
print(entry['Genre']). This gives us a nice list of Genres such as Alternative, Electronic, Pop &etc.
Great! We now have a way to print out all the songs in the Playlist, listing only the information that we’re interested in.
There’s a couple things here that are less than optimal:
1. Each Album is listed multiple times - once for each song
2. There aren’t any albums by Virginia Woolf, though there is one track of her speaking
I fixed the first problem by creating a new dictionary, then only adding the data from the CSV’s dictionary entries if it didn’t already exist. Like so:
The second problem is trickier, and actually has two parts to it.
Classical albums assume the ‘Artist’ is the performer, so we might want to see the Composer instead (so you can find it by ‘Bach’, rather than ‘Deutsches Filmorchester Babelsberg’).
Also, each song might list a different artist or combination of artists. The album metadata probably has the definitive version, but that info isn’t available in the playlist export from iTunes. My solution was to first create a list of all entries, then find the one used most frequently. As the differing names/combinations are true for the other categories as well, I wrote:
Now I just needed to print out the most frequent entry in each list.
I had no idea how to do that, so turned to Stack Overflow again. I learned that
max() returns the largest item in an iterable, so you can use:
… to find the most frequently used item in a list. I used the
most_common() method to clean up the dictionary as I printed it out:
This was super fun to work through. It seems data is always messy, so there are many twists and turns before you get to the clean results you’re looking for. The fun part of being a Python n00b is the delightful discoveries in even the most mundane exercises.
I’m using this more and more at work, and store most all my data in .csv format for easy retrieval.
All code is up on GitHub