R, Genderize.io and Azure ML combine to detect male/female members of parliament

I want to reproduce some published research on the gender mix of corporate boards of directors. I’ve seen many articles published with statistics that show women under-represented on boards, and that companies with gender-balanced boards do slightly better.

You can get data on who is on the board of a public company, but that tends to be only their names. My theory is that I can probably infer gender from name. You can easily imagine getting a baby name book and looking up whether “John” is in the boy section or not. Of course, as soon as you get to Hillary and Leslie we are in trouble. And that’s only starting with my limited global knowledge of names. So, I used an API called genderize.io to outsource those problems.

This blog post is about a test project to verify that genderize works, by using a dataset of British parliament members that has names and gender, so we have something to check the results against.

Aside: I appreciate that gender isn’t a binary quality and it isn’t just biology. I don’t know enough to include a discussion of those that self-identify as a gender other than male or female. I think/assume that the % of people who feel that way is unlikely to invalidate conclusions.*

It’s also about my philosophy of doing data; it’s not programming. You need the results; you don’t need to create high-quality code that is a reusable framework.

In order to preserve some suspense, the Azure part comes in later J

Getting the MP data

In the UK we have a great digital government effort. The gov.uk site is fantastic, but what is really impressive is the amount of data that is supporting it. I’ve long been a fan of TheyWorkForYou.com which takes your postcode and enables you to search the minutes of parliament debates your representative has ever spoken in.

But, we need much simpler data: just the names and genders of the MPs. Turns out there is a great RESTful web service for MPs with a query language. But we just need this one link that returns the current MPs.

Actually, it returns XML. I’m a bit better with Powershell than I am with R, so I just wrote a quick ‘n’ dirty powershell script to rip the XML and turn it into CSV which R handles trivially.

XML looks like this:

Of course, we immediately get some funny business about dropping the honorifics like “Rt Hon”, “Dr”, “Prof”, “Sir”, etc. and parsing out the name, but we get there soon and PowerShell’s native XML support is excellent; tab-completion immediately makes getting it working very fast. I used a Regex that’s so awful it hurts:

$raw.Members.ChildNodes | select Member_Id,FullTitle, @{l=”firstname”; e={$_.listas -imatch “(\w+)(,\s)(ms\s*)*(dr\s*)*(mrs\s*)*(mr\s*)*(sir\s*)*(\w+)” |Out-Null ; $Matches[8] }},Gender | Export-Csv -Force -NoTypeInformation C:\Temp\mpnames.csv -Encoding UTF8

 

We end up with a CSV file that has Member_Id, FullTitle, firstname ,Gender.

You can see that we have to watch the encoding, using UTF as we pass from one application to another. Especially as we are cross platform.

R library names are highlighted.

Getting the gender from the name with genderize.io

Then I import the CSV file produce by PowerShell into R. Pretty easy just a call to the tidyverse method read_csv. I always do a bit of cleanup, in this case using the pipe operator %>% to send all the columns to rename_all to change their names to lowercase, as R is case-sensitive.

mpnames <- read_csv(“c:\\temp\\mpnames.csv”) %>%

rename_all(tolower)

Then I select the distinct names out using dplyr and then iterate through them.

I tested the genderize API using a browser and in PowerShell using curl, then wrote a quick ‘n’ dirty method in R to do the GET, parse the JSON and turn it into a 1 row data.frame.

Then I iterate the names using plyr, which has a nice method for taking a list of things, doing a foreach method call with them and scooping up the results into one big table.

I had a bit of bother with the genderise web service, because occasionally it would throw an error and I don’t program R well enough yet to do the error handling and I’d lose the whole set of results. That wouldn’t be a problem, except that the free version of the genderise API is rate limited to 1000 calls per day. So if I need to keep re-running I’ll soon run out. So my crude hack was to run 100 at a time then manually smash them together using dplyr’s bind_rows.

Yes it’s crude, but we are thinking about the results here, and don’t care about the programming!

genderDB <- plyr::ldply(.data = (distinct_names %>% slice(1:100))$firstname, .fun = getSingleName)

genderDB2 <- plyr::ldply(.data = (distinct_names %>% slice(101:200))$firstname, .fun = getSingleName)

genderDB3 <- plyr::ldply(.data = (distinct_names %>% slice(201:300))$firstname, .fun = getSingleName)

genderDB4 <- plyr::ldply(.data = (distinct_names %>% slice(301:400))$firstname, .fun = getSingleName)

gender_all = bind_rows(genderDB, genderDB2, genderDB3, genderDB4)

 

Analysing the results of genderise name scoring

Genderise returns a result for each name that gives a % probability, and a number of counts of the name in the genderise dataset:

{“name”:”peter”,”gender”:”male”,”probability”:”0.99″,”count”:796}

So if we want to be “certain” that a name has been accurately classified by gender, we’d look for a high probability and a large count. What’s a good number? Well you choose what “certain” means to you.

 

 

So we get a fair chunk of them that are not high confidence, either because they have low count of names or lower than 0.99 probability of correctness.

I create the plot with this code using ggplot:

# do a plot, but colour code the uncertain ones

ggplot(data = join %>%

mutate(high_confidence = (!(probability < 0.99 | count < 300)))) +

geom_bar(alpha=0.6,

mapping = aes(x=gender, fill=high_confidence))

 

Here we do a basic ggplot, which works by taking data and sending it a type of plot (the geom_bar here). Also here we create an extra field on the data set by using the %>% pipe operator to send the data through a mutate call that adds the new calculated column of high_confidence. The NA are where genderize refused to give a result because it didn’t have the name in it’s database.

Comparision with the original data from the government gender field is good! If we drop the confidence then we get a few errors creeping in. As you’d expect, I guess

Azure and what to do about the low-confidence results

So in this situation when we have some low confidence results, we can either filter them out, or accept them knowing that a few will be wrong (but a significant few?) or we can find another data source to try and patch the problem. Sometimes you might be tempted to manually process a few, but this may not get better results.

Something caught my eye while I was looking at the cases where genderise was wrong/uncertain Checking on the MP called “Hillary” who is a man, I happened across his picture. But I also found that the government API could also offer official portraits!

A quick google brought me to this Azure Cognitive services page, which offers many image analysis APIs. I gave it a try with this face:


Hilary Benn MP

And got back successful identification as a man’s face. I signed up for an API key using my MSDN account and put together this PowerShell call as proof of concept. This took about 20 minutes.

 

$memberId
=
172

$key
=
“00baf713c2eb4d48b118bdca2daf2d0e”

$imageUrl
=
http://data.parliament.uk/membersdataplatform/services/images/MemberPhoto/$memberId/”

 

$headers
= @{}

$headers.Add(“Content-Type”, “application/json”)

$headers.Add(“Ocp-Apim-Subscription-Key”,
$key)

 

$body
=
‘{“url”:”‘+
$imageUrl
+ ‘”}’

 

$response
=
curl
-UseDefaultCredentials
-Method
Post
-Uri
-Headers
$headers
-Body
$body
https://westcentralus.api.cognitive.microsoft.com/vision/v1.0/analyze?visualFeatures=Faces&language=en&#8221;

$json
=
$response.Content |
ConvertFrom-Json

 

write-host
“gender is $($json.faces[0].gender)

 

Pretty simple, uses a POST to ship the url of the image, or you can encode the image and upload it.

There is a great page for testing calls to the API

Getting it to work in R was a little more tedious. I tried to use the native httr library for making the http call, but I just couldn’t do it. It’s here that R’s syntax is a pain. As it lacks the object.method syntax, everything disappears inside brackets. To cut a long story short I think that the problem was encoding the body, but I couldn’t see what was happening. Also, the httr library seems to work at the TCP socket level so you can’t spy on the connection with Fiddler, which took me a while to figure out. So I fell back on previous approaches which use the MS COM stack. Again, cut and paste, people. It works.

Final results were that the high quality results from genderize combined with the image lookup were 100% success over the 600 MPs.

Conclusion patching data together gets better results

Using multiple sources of data improves results, however the additional processing is expensive both in computation time and in time to research and code alternatives. However the cloud ML services that are available are quick to wire in and allow access to the cutting edge in a few seconds. If I’d had to think about doing image recognition myself the project would have ended there, even though I know it is academically possible.

Also, hacking it just works when you care about the result.

 

 

 

*Other aside: thorny stuff, talking about gender. Is it OK to study gender like this? Well first, I think it’s a problem if leadership of countries/companies isn’t balanced. That’s a personal assumption and a bias. Second, I would study go further and study race, religion and nationality balance if I could do it in a way that respected the anonymity of those being studied… But we couldn’t do that without seriously stalking people via social media or other means; and that feels like a line has been crossed to uncover something personal about them. So why doesn’t that apply to gender? For a % of the population their self-identifying gender is as private as their religious beliefs. Well, you could argue that being in the leadership position of a public company is one that requires you to be accountable; and part of that accountability costs you a measure of your anonymity. You could also argue that gender is in the public domain for a vast majority of the population. Where the line is crossed is something that we haven’t figured out in a world that is awash with personal data. I think I’ve stayed on the correct side of the line.

Also, what happens if I create evidence that gender-balanced boards do worse for companies? I think that actually, any evidence that gender affects outcomes is interesting (and worrying) but would probably require much deeper research.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s