R, Genderize.io and Azure ML combine to detect male/female members of parliament

I want to reproduce some published research on the gender mix of corporate boards of directors. I’ve seen many articles published with statistics that show women under-represented on boards, and that companies with gender-balanced boards do slightly better.

You can get data on who is on the board of a public company, but that tends to be only their names. My theory is that I can probably infer gender from name. You can easily imagine getting a baby name book and looking up whether “John” is in the boy section or not. Of course, as soon as you get to Hillary and Leslie we are in trouble. And that’s only starting with my limited global knowledge of names. So, I used an API called genderize.io to outsource those problems.

This blog post is about a test project to verify that genderize works, by using a dataset of British parliament members that has names and gender, so we have something to check the results against.

Aside: I appreciate that gender isn’t a binary quality and it isn’t just biology. I don’t know enough to include a discussion of those that self-identify as a gender other than male or female. I think/assume that the % of people who feel that way is unlikely to invalidate conclusions.*

It’s also about my philosophy of doing data; it’s not programming. You need the results; you don’t need to create high-quality code that is a reusable framework.

In order to preserve some suspense, the Azure part comes in later J

Getting the MP data

In the UK we have a great digital government effort. The gov.uk site is fantastic, but what is really impressive is the amount of data that is supporting it. I’ve long been a fan of TheyWorkForYou.com which takes your postcode and enables you to search the minutes of parliament debates your representative has ever spoken in.

But, we need much simpler data: just the names and genders of the MPs. Turns out there is a great RESTful web service for MPs with a query language. But we just need this one link that returns the current MPs.

Actually, it returns XML. I’m a bit better with Powershell than I am with R, so I just wrote a quick ‘n’ dirty powershell script to rip the XML and turn it into CSV which R handles trivially.

XML looks like this:

Of course, we immediately get some funny business about dropping the honorifics like “Rt Hon”, “Dr”, “Prof”, “Sir”, etc. and parsing out the name, but we get there soon and PowerShell’s native XML support is excellent; tab-completion immediately makes getting it working very fast. I used a Regex that’s so awful it hurts:

$raw.Members.ChildNodes | select Member_Id,FullTitle, @{l=”firstname”; e={$_.listas -imatch “(\w+)(,\s)(ms\s*)*(dr\s*)*(mrs\s*)*(mr\s*)*(sir\s*)*(\w+)” |Out-Null ; $Matches[8] }},Gender | Export-Csv -Force -NoTypeInformation C:\Temp\mpnames.csv -Encoding UTF8

 

We end up with a CSV file that has Member_Id, FullTitle, firstname ,Gender.

You can see that we have to watch the encoding, using UTF as we pass from one application to another. Especially as we are cross platform.

R library names are highlighted.

Getting the gender from the name with genderize.io

Then I import the CSV file produce by PowerShell into R. Pretty easy just a call to the tidyverse method read_csv. I always do a bit of cleanup, in this case using the pipe operator %>% to send all the columns to rename_all to change their names to lowercase, as R is case-sensitive.

mpnames <- read_csv(“c:\\temp\\mpnames.csv”) %>%

rename_all(tolower)

Then I select the distinct names out using dplyr and then iterate through them.

I tested the genderize API using a browser and in PowerShell using curl, then wrote a quick ‘n’ dirty method in R to do the GET, parse the JSON and turn it into a 1 row data.frame.

Then I iterate the names using plyr, which has a nice method for taking a list of things, doing a foreach method call with them and scooping up the results into one big table.

I had a bit of bother with the genderise web service, because occasionally it would throw an error and I don’t program R well enough yet to do the error handling and I’d lose the whole set of results. That wouldn’t be a problem, except that the free version of the genderise API is rate limited to 1000 calls per day. So if I need to keep re-running I’ll soon run out. So my crude hack was to run 100 at a time then manually smash them together using dplyr’s bind_rows.

Yes it’s crude, but we are thinking about the results here, and don’t care about the programming!

genderDB <- plyr::ldply(.data = (distinct_names %>% slice(1:100))$firstname, .fun = getSingleName)

genderDB2 <- plyr::ldply(.data = (distinct_names %>% slice(101:200))$firstname, .fun = getSingleName)

genderDB3 <- plyr::ldply(.data = (distinct_names %>% slice(201:300))$firstname, .fun = getSingleName)

genderDB4 <- plyr::ldply(.data = (distinct_names %>% slice(301:400))$firstname, .fun = getSingleName)

gender_all = bind_rows(genderDB, genderDB2, genderDB3, genderDB4)

 

Analysing the results of genderise name scoring

Genderise returns a result for each name that gives a % probability, and a number of counts of the name in the genderise dataset:

{“name”:”peter”,”gender”:”male”,”probability”:”0.99″,”count”:796}

So if we want to be “certain” that a name has been accurately classified by gender, we’d look for a high probability and a large count. What’s a good number? Well you choose what “certain” means to you.

 

 

So we get a fair chunk of them that are not high confidence, either because they have low count of names or lower than 0.99 probability of correctness.

I create the plot with this code using ggplot:

# do a plot, but colour code the uncertain ones

ggplot(data = join %>%

mutate(high_confidence = (!(probability < 0.99 | count < 300)))) +

geom_bar(alpha=0.6,

mapping = aes(x=gender, fill=high_confidence))

 

Here we do a basic ggplot, which works by taking data and sending it a type of plot (the geom_bar here). Also here we create an extra field on the data set by using the %>% pipe operator to send the data through a mutate call that adds the new calculated column of high_confidence. The NA are where genderize refused to give a result because it didn’t have the name in it’s database.

Comparision with the original data from the government gender field is good! If we drop the confidence then we get a few errors creeping in. As you’d expect, I guess

Azure and what to do about the low-confidence results

So in this situation when we have some low confidence results, we can either filter them out, or accept them knowing that a few will be wrong (but a significant few?) or we can find another data source to try and patch the problem. Sometimes you might be tempted to manually process a few, but this may not get better results.

Something caught my eye while I was looking at the cases where genderise was wrong/uncertain Checking on the MP called “Hillary” who is a man, I happened across his picture. But I also found that the government API could also offer official portraits!

A quick google brought me to this Azure Cognitive services page, which offers many image analysis APIs. I gave it a try with this face:


Hilary Benn MP

And got back successful identification as a man’s face. I signed up for an API key using my MSDN account and put together this PowerShell call as proof of concept. This took about 20 minutes.

 

$memberId
=
172

$key
=
“00baf713c2eb4d48b118bdca2daf2d0e”

$imageUrl
=
http://data.parliament.uk/membersdataplatform/services/images/MemberPhoto/$memberId/”

 

$headers
= @{}

$headers.Add(“Content-Type”, “application/json”)

$headers.Add(“Ocp-Apim-Subscription-Key”,
$key)

 

$body
=
‘{“url”:”‘+
$imageUrl
+ ‘”}’

 

$response
=
curl
-UseDefaultCredentials
-Method
Post
-Uri
-Headers
$headers
-Body
$body
https://westcentralus.api.cognitive.microsoft.com/vision/v1.0/analyze?visualFeatures=Faces&language=en&#8221;

$json
=
$response.Content |
ConvertFrom-Json

 

write-host
“gender is $($json.faces[0].gender)

 

Pretty simple, uses a POST to ship the url of the image, or you can encode the image and upload it.

There is a great page for testing calls to the API

Getting it to work in R was a little more tedious. I tried to use the native httr library for making the http call, but I just couldn’t do it. It’s here that R’s syntax is a pain. As it lacks the object.method syntax, everything disappears inside brackets. To cut a long story short I think that the problem was encoding the body, but I couldn’t see what was happening. Also, the httr library seems to work at the TCP socket level so you can’t spy on the connection with Fiddler, which took me a while to figure out. So I fell back on previous approaches which use the MS COM stack. Again, cut and paste, people. It works.

Final results were that the high quality results from genderize combined with the image lookup were 100% success over the 600 MPs.

Conclusion patching data together gets better results

Using multiple sources of data improves results, however the additional processing is expensive both in computation time and in time to research and code alternatives. However the cloud ML services that are available are quick to wire in and allow access to the cutting edge in a few seconds. If I’d had to think about doing image recognition myself the project would have ended there, even though I know it is academically possible.

Also, hacking it just works when you care about the result.

 

 

 

*Other aside: thorny stuff, talking about gender. Is it OK to study gender like this? Well first, I think it’s a problem if leadership of countries/companies isn’t balanced. That’s a personal assumption and a bias. Second, I would study go further and study race, religion and nationality balance if I could do it in a way that respected the anonymity of those being studied… But we couldn’t do that without seriously stalking people via social media or other means; and that feels like a line has been crossed to uncover something personal about them. So why doesn’t that apply to gender? For a % of the population their self-identifying gender is as private as their religious beliefs. Well, you could argue that being in the leadership position of a public company is one that requires you to be accountable; and part of that accountability costs you a measure of your anonymity. You could also argue that gender is in the public domain for a vast majority of the population. Where the line is crossed is something that we haven’t figured out in a world that is awash with personal data. I think I’ve stayed on the correct side of the line.

Also, what happens if I create evidence that gender-balanced boards do worse for companies? I think that actually, any evidence that gender affects outcomes is interesting (and worrying) but would probably require much deeper research.

Quick loops and testing command performance

I heard about someone testing the speed of a powershell call using measure-command. I combined it with a quick loop and the cleverness of the pipeline to measure-object.

1..10 |% { Measure-Command -Expression {sleep 1}} | Measure-Object -Property TotalMilliseconds -Average -Maximum -Minimum

 

So the quick loop is to do 1..10 which produces a list of integers 1 to 10. We then pipe that into |% (which does foreach), and then into measure-command which produces an output like this:

PS C:\batchfiles> Measure-Command -Expression {write-host “hello”}

hello

 

Days : 0

Hours : 0

Minutes : 0

Seconds : 0

Milliseconds : 19

Ticks : 190751

TotalDays : 2.2077662037037E-07

TotalHours : 5.29863888888889E-06

TotalMinutes : 0.000317918333333333

TotalSeconds : 0.0190751

TotalMilliseconds : 19.0751

 

And then the pipeline combines the multiple calls into a “table” of objects (well every object has the same properties, which means the shell can print it like a table)

PS C:\batchfiles> 1..10 |% {Measure-Command -Expression {write-host “hello”}} | Ft

hello


hello

hello

Days Hours Minutes Seconds Milliseconds

—- —– ——- ——- ————

0 0 0 0 10

0 0 0 0 0

0 0 0 0 0

0 0 0 0 4

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

 

And then the rest is just the magic of measure-object.

 


 

RStudio grabbing data from Excel

Getting started with R can be the most tricky part, and the part where you lose out most to a click ‘n’ drag solution like Excel and Tableau.

So how can we get started faster and level that playing field? Well I must unlearn what I have learned about programming.


wise green data analyst

 

Well I just found this little “Import dataset” button in Rstudio which gets you up and running faster

 

This is another case where I need to change my mindset: R is not programming, it’s just doing the work of analysis. We don’t need to write a program, we need to do the analysis in this productivity environment that uses code. So don’t write a program that imports and cleans data from a file… click that button and just import the data to the environment and get started!

Clicking it give a few options (CSV, Stata, etc.) but of course the main one for us is EXCEL.

If lets you look at the columns and tabs in the sheet, skip rows that don’t contain data and you just click go and you have the data in your environment ready to go. Or you can cut the code that loads the file, just as easily.

Don’t code, analyse using code. Use the productivity features of the data workshop and you must unlearn what you have learned.

 

New laptop with chocolatey

I’ve heard of chocolatey before, but when I was rebuilding my laptop this time I was nudged at just the right time*

I keep a list of everything I install onto my laptop so I can put it back together again when I lose my machine.

But this time when I was putting back tools like winmerge, I did it like this:

choco install winmerge

at the powershell prompt.

It doesn’t work for everything, but made a few steps a lot quicker. Perhaps for teams with standard tools, this would help. SCCM is another way, but these packages are maintained by someone else! So some risks apply.

R and global variables, unlearning programming habits for data analysis

I’ve said many a time that doing data analysis in R is not programming, and yet I have the habits of a lifetime to undo.

This is not about encapsulation and abstraction, it’s about getting a result. Less about unlearning programming and more that I need to learn the idioms of data analysis, and the way to use data analysis tools like RStudio.

I’m working on a quite involved piece of R code that I’ve been tinkering with and adding too for a few days, and it’s getting a little bit big. I mean it’s only a few hundred lines, but already (as with Powershell) without the structures of modules, classes and namespaces it’s getting messy.

So big, in fact I’ve started to separate sections with big comment banners like this:

When you have a comment banner, you have a problem.

 

Global variables are not evil

This is such a piece of folklore amongst developers, like “GOTO considered harmful” that it’s hard to give up.

My program follows the classic data analysis model:

(from R for data science)

I’ve just realised what I should be doing: I should have a few R script files, each that does a different bit, but they can communicate via global variables!

No! Wait, come back!

 

RStudio is not an IDE!

RStudio isn’t for programming. It’s a data science whiteboard, for playing with ideas. You keep the data in the environment while you are working; like a clipboard. They aren’t even global to the script! They are outside of the script, they even survive when you shut RStudio and start it again…

Use the global. I think that RStudio projects are perfect for embracing the global variable

 

Text Mining in R

So I heard from someone that they are using R to mine text, to look for sentiment in statements about the market.

I thought I’d give it a try, but instead using the Gutenberg Press text of Jane Eyre.

I used the TextMining package because I found that first, and it got me started, though I haven’t done any of the real work of analysing text (like looking for correlations between words).

But still, got me started.

#text mining

library(tm)

library(wordcloud2)

library(tidyverse)

 

docs <- Corpus(DirSource(pattern=”text_source*”, ignore.case = TRUE, encoding = “UTF-8”))

 

# don’t use this, it seems to break everything

# inspect(docs)

 

 

# clean the docs

docs <- tm_map(docs, removePunctuation)

# stopwords is a very slow step, avoid running it in demo

docs <- tm_map(docs, removeWords, stopwords(“english”))

docs <- tm_map(docs, removeWords, c(“will”, “now”, “one”, “said”, “like”, “little”))

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, tolower)

 

 

 

dtm <- DocumentTermMatrix(docs)

freq <-colSums( as.matrix(dtm))

ord <- order(freq,decreasing=TRUE)

 

tops <- freq[head(ord, 1000)]

 

 

wf <- tibble(word=labels(tops), count=tops) %>% filter (count < 500)

 

wordcloud2(data = wf)