Paul Vosper: Technical Artist

Playful Python Problems

Learning to use the CSV Module

I’ve been wanting to use the CSV Module for some projects at work, but have never had the time to learn it. When I also wanted to audit the release dates on some old Jazz albums in iTunes, I thought this might be a perfect opportunity to spend a couple hours reading, experimenting and putting the module to practical use.

iTunes has an ‘Export Playlist’ option, which creates a CSV file of song metadata so I’d have plenty of material to work with. The first line of the playlist file has the headers:

					    
> Name Artist Composer Album ...
					    
					  

…while the remaining lines contain the data:

					    
> Vladivostok	Robyn Miller		Little Potato (Original Soundtrack)	 ...
					    
					  

It’s tab separated values - not commas - but I figured that I could work with this.

Armed with the weighty Python 3 Standard Library by Example by Doug Hellmann, I started in.

The CSV Module is easy enough: after importing it, you open your CSV file and then create an object from it’s contents. You have the option of reading the data in as a flat list, or as a dictionary. For what I wanted to do, a dictionary seemed ideal; so, I started with this (straight from Hellman’s example):

					    
with open('playlist.txt', 'rt') as f:
	reader = csv.DictReader(f)
	for entry in reader:
		print(entry)
					    
					  

Uh-oh - I hit this right off:

					    
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
					    
					  

I knew that ‘UTF-8’ was a unicode text thing, but didn’t know why this would have an error loading such a simple text file. I started with the Wikipedia article on UTF-8, which seemed straightforward enough. Further reading revealed that iTunes playlists are exported in UTF-16 which starts with a Byte Order Mark (BOM) that is incomprehensible in UTF-8. I also found furious debate over which encoding was ‘best’, and even a manifesto! At least I felt I was in good company with my confusion.

So, my playlist file had characters that were unreadable by the CSV module’s default UTF-8 encoding. After a quick search on Stack Overflow I found that I needed to use encoding='utf-16' instead of the example’s ‘rt’ :

					    
with open('playlist.txt', encoding='utf-16') as f:
    reader = csv.DictReader(f)
    for entry in reader:
        print(entry)
					    
					  

Better - no error, but the dictionary it created was less than useful:

					    
OrderedDict([('Name\tArtist\tComposer\tAlbum\tGrouping\tWork\tMovement Number...
					    
					  

Reading a bit further into the Standard Library by Example, I found that the CSV module assumes comma separated values (duh), so I needed to specify:

					    
reader = csv.DictReader(f, dialect='excel-tab')

Much better results!

					    
OrderedDict([('Name', 'The Endless Battle of the Maudlin Ballade, Pt. 2 (Bonus Track)'), ('Artist', 'Adam Bryanbaum Wiltzie'), ...
					    
					  

So each ‘entry’ is a dictionary object, with the keys defined by the initial line of the CSV file. As each dictionary uses the same set of keys, you can easily find any piece of data using something like print(entry['Genre']). This gives us a nice list of Genres such as Alternative, Electronic, Pop &etc.

Great! We now have a way to print out all the songs in the Playlist, listing only the information that we’re interested in.

					    
print('{}: {} ({})'.format(entry['Album'], entry['Artist'], entry['Year']))

...
The Temple of I & I: Thievery Corporation (2017)
The Temple of I & I: Thievery Corporation (2017)
The Temple of I & I: Thievery Corporation (2017)
Three Worlds: Music from Woolf Works: Virginia Woolf (2017)
The World We Built: The Wild Reeds (2017)
The World We Built: The Wild Reeds (2017)
The World We Built: The Wild Reeds (2017)
...
					    
					  

There’s a couple things here that are less than optimal:

1. Each Album is listed multiple times - once for each song

2. There aren’t any albums by Virginia Woolf, though there is one track of her speaking

I fixed the first problem by creating a new dictionary, then only adding the data from the CSV’s dictionary entries if it didn’t already exist. Like so:

					    
if entry['Album'] in playlist_dictionary:
    pass
else:
    playlist_dictionary[entry['Album']] = (entry['Artist'], entry['Year'])
					    
					  

The second problem is trickier, and actually has two parts to it.

Classical albums assume the ‘Artist’ is the performer, so we might want to see the Composer instead (so you can find it by ‘Bach’, rather than ‘Deutsches Filmorchester Babelsberg’).

					    
album = entry['Album']
artist = entry['Artist']
performer = entry['Artist']
year = entry['Year']
if 'classical' in entry['Genre'].lower():
    artist = entry['Composer']
					    
					  

Also, each song might list a different artist or combination of artists. The album metadata probably has the definitive version, but that info isn’t available in the playlist export from iTunes. My solution was to first create a list of all entries, then find the one used most frequently. As the differing names/combinations are true for the other categories as well, I wrote:

					    
if album in playlist_dictionary:
    playlist_dictionary[album][0].append(artist)
    playlist_dictionary[album][1].append(performer)
    playlist_dictionary[album][2].append(year)
else:
    playlist_dictionary[album] = [[artist], [performer], [year]]
					    
					  

Now I just needed to print out the most frequent entry in each list.

I had no idea how to do that, so turned to Stack Overflow again. I learned that max() returns the largest item in an iterable, so you can use:

					    
def most_common(lst):
    return max(set(lst), key=lst.count)
					    
					  

… to find the most frequently used item in a list. I used the most_common() method to clean up the dictionary as I printed it out:

					    
for entry in playlist_dictionary:
    album = entry
    artist = most_common(playlist_dictionary[entry][0])
    performer = most_common(playlist_dictionary[entry][1])
    year = most_common(playlist_dictionary[entry][2])
    print('{} | {} [Performed by {}, {}]'.format(album, artist, performer, year))
					    
					  

This was super fun to work through. It seems data is always messy, so there are many twists and turns before you get to the clean results you’re looking for. The fun part of being a Python n00b is the delightful discoveries in even the most mundane exercises.

I’m using this more and more at work, and store most all my data in .csv format for easy retrieval.

All code is up on GitHub

18.02.06