Data about packages

Finding packages and extensions and apps for everything from text editors to Spotify to programming environments* is 50% of the battle for productivity.

What better way to decide than with some data?

Here is some code for getting download data from CRAN on R packages:

install.packages(“devtools”)

library(devtools)

devtools::install_github(“metacran/cranlogs”)

 

cranlogs::cran_downloads(when = “last-month”, packages = c(“ggplot2”))

cranlogs::cran_downloads(when = “last-month”, packages = c(“FactoMineR”))

 

or you could look at the reviews as well:

https://www.crantastic.org/popcon

 

 

*Of course, it’s all about vertically sharded value chains and creating an ecosystem that allows disruptors to flourish. Or something. Fintech goes here. And blockchain.

Advertisements

Ungroup in powershell

Powershell often forms part of my data hacking arsenal. When I get a text file that’s so messy it’s not fit to be CSV yet, I PoSh on it.

So, you’ve imported some data into Powershell and you’ve got a list of stuff, but there are some keys that have more than one row.

You can look for the duplicates with:

$myData | group key

The output of the group-object operation is something like a dictionary, with a summary count, so you can easily filter them as:

$myData | group key |? {$_.count -gt 1}

But then you want the original rows back, not this dictionary..

Do this:

$myData | group key |? {$_.count -gt 1} | select –Expandproperty group

This also works on any nested object that is being returned in a property in a select.

There is also an “ungroup” operation in the R package dplyr.

Enterprise tech meetup

I went to a tech meetup with the thrilling name “London Enterprise Tech Meetup“, along with Luke Boucher. Thanks for the invite Luke.


who says tech has a diversity problem?

The audience was mostly as smart as the speakers, so there was some good debate, some good points, some – as one speaker put it – “mini presentations disguised as questions” J . My mini takeaways:

  • People doing similar things are easiest to understand, and they have similar problems… the downside is of course that listening to them is a bit of an echo chamber, you hear your own problems reflected back
  • It provided a datapoint on what “big” is; a few comments that above a billion data points then even data platforms are struggling and you need something really bespoke.
  • Julia as a programming / data science system has some interesting stuff that might be worth a look
  • Hadoop got a real bashing, which was news to me
  • Natural language processing can be well integrated into human-lead research;
    • The example given by Pontus Stenetorp was drug interactions.
    • Say there are 8000 types of drug each with a detailed document of how it works
    • There are unknown interactions between these drugs that have not been discovered
    • Human doctors would have to read and understand all 8000 to be effective, or create their own mini databases with their expert knowledge.
    • NLP systems can extract word/concepts within this limited context and look for correlations
    • We don’t expect the system to understand drugs, it just needs to read how the drug works (i.e., which organs or chemical pathways it uses) and look for connections
    • Then present the human doctors with a much shorter list of documents to read

So was it worth it? It’s not about new ideas and changing your mind. It’s more about synthesizing your current knowledge with new elements, rather than supplanting what you know. There were a few new elements, so yes, worthwhile. Other than Pontus Stenetorp, presentations were predictably awful but I can google faster than they can speak. It’s just a indicator: here’s an interesting thing, look it up.

As a networking event I didn’t find it effective, but I didn’t stay long enough to make that happen. I think to get the best out of it I’d have to work a lot harder at saying hello and dragging it out of them.

 

 

 

 

 

R and geographic maps

Let’s start with code and a picture:

library(maps)

map(“county”)

 


if you have a bit of population data from the US census, you can do a quick population map easily enough:

(no code because mine has the oddities of my data structures in it)

What’s nice is that these maps aren’t special, they are simply a vast array of points that define the boundaries! Just plotted by brute force.

Improving data by patching with other data sources

I had also some data on locations, and wanted to link them to a county. In 3% of cases there was no county, but we did have a latitude and longitude. A few googles indicated that the lat/long was correct even for the blanks (right on the roof actually…) So let’s patch together the data with a bit of R to fill in the blanks.

What do you mean “in”?

Obvious cases of when a point is inside an area are obvious:

 

Less obvious cases like lakes and offshore islands:

Things do get this bad in geography with the wonderful concept of exclaves or an island in a lake on an island in a lake on an island.

Probably means that there is a whole lot of things to consider when you detect if a point is inside a polygon. Things like ensure you are using the same projection of lat/long and reference points.

Library to the rescue!

Of course there is a library to do that, but they tend to start from a point of view of knowing what you are doing!

So eventually I googled a lot until I found the one that was most similar and simple:

https://gis.stackexchange.com/questions/133625/checking-if-points-fall-within-polygon-shapefile

https://stackoverflow.com/questions/26062280/converting-a-map-object-to-a-spatialpolygon-object?noredirect=1&lq=1

but basically:

  1. Found a library that converts that data structure from maps library into a list of polygons
    1. (polygons are not a first class citizen in R language, but have a stable implementation supported by many libraries)
  2. Found the glue code to stitch together an arbitrary list of lat/longs into a structure that the SpatialPointDataFrame can accept into it’s constructor
  3. Then we are simply using the method over which does all the hard work

When I checked my end results things looked good! When comparing to a test set where I DID know the county in the original data I found that my lat/long lookup was correct in 99% of cases. The 1% left over was revealed on google maps to be within 100m of a county boundary, which feels explicable.. but bad enough that I don’t want to use this patching method unless I have to…

So I think I improved my data from being 3% bad to 0.03% bad.

Code looks like this but YMMV:

(as this was a hacked copy from the internet I’ve got some things that are not part of the tidyverse, which is my recommended style)

# https://stackoverflow.com/questions/8751497/latitude-longitude-coordinates-to-state-code-in-r

# tiny change from original code to do it for county as well

latlong2state <- function(pointsDF) {

# Prepare SpatialPolygons object with one SpatialPolygon

# per state (plus DC, minus HI & AK)

geo_states <- map(‘county’, fill=TRUE, col=”transparent”, plot=FALSE)

#IDs <- sapply(strsplit(geo_states$names, “:”), function(x) x[1])

states_sp <- map2SpatialPolygons(states, IDs=geo_states$name,

proj4string=CRS(“+proj=longlat +datum=WGS84”))

 

# Convert pointsDF to a SpatialPoints object

pointsSP <- SpatialPointsDataFrame(data.frame( pointsDF$longitude, pointsDF$latitude),

data= pointsDF,

proj4string=CRS(“+proj=longlat +datum=WGS84”))

 

# Use ‘over’ to get _indices_ of the Polygons object containing each point

indices <- over(pointsSP, states_sp)

 

# Return the state names of the Polygons object containing each point

stateNames <- sapply(states_sp@polygons, function(x) x@ID)

pointsDF$looked_up_bare_state_name <- sapply(strsplit(stateNames[indices], “,”), function(x) x[1])

pointsDF$looked_up_bare_county_name <- sapply(strsplit(stateNames[indices], “,”), function(x) x[2])

return(pointsDF)

 

}

 

# check it with a test set

test_county_checker <- latlong2state(stores %>%

select(year, store_number, store_name, latitude, longitude, original_bare_state_name, original_bare_county_name) %>%

slice(1:5000))

test_county_checker %>% filter(original_bare_county_name != looked_up_bare_county_name,

original_bare_state_name != looked_up_bare_state_name)

R, Genderize.io and Azure ML combine to detect male/female members of parliament

I want to reproduce some published research on the gender mix of corporate boards of directors. I’ve seen many articles published with statistics that show women under-represented on boards, and that companies with gender-balanced boards do slightly better.

You can get data on who is on the board of a public company, but that tends to be only their names. My theory is that I can probably infer gender from name. You can easily imagine getting a baby name book and looking up whether “John” is in the boy section or not. Of course, as soon as you get to Hillary and Leslie we are in trouble. And that’s only starting with my limited global knowledge of names. So, I used an API called genderize.io to outsource those problems.

This blog post is about a test project to verify that genderize works, by using a dataset of British parliament members that has names and gender, so we have something to check the results against.

Aside: I appreciate that gender isn’t a binary quality and it isn’t just biology. I don’t know enough to include a discussion of those that self-identify as a gender other than male or female. I think/assume that the % of people who feel that way is unlikely to invalidate conclusions.*

It’s also about my philosophy of doing data; it’s not programming. You need the results; you don’t need to create high-quality code that is a reusable framework.

In order to preserve some suspense, the Azure part comes in later J

Getting the MP data

In the UK we have a great digital government effort. The gov.uk site is fantastic, but what is really impressive is the amount of data that is supporting it. I’ve long been a fan of TheyWorkForYou.com which takes your postcode and enables you to search the minutes of parliament debates your representative has ever spoken in.

But, we need much simpler data: just the names and genders of the MPs. Turns out there is a great RESTful web service for MPs with a query language. But we just need this one link that returns the current MPs.

Actually, it returns XML. I’m a bit better with Powershell than I am with R, so I just wrote a quick ‘n’ dirty powershell script to rip the XML and turn it into CSV which R handles trivially.

XML looks like this:

Of course, we immediately get some funny business about dropping the honorifics like “Rt Hon”, “Dr”, “Prof”, “Sir”, etc. and parsing out the name, but we get there soon and PowerShell’s native XML support is excellent; tab-completion immediately makes getting it working very fast. I used a Regex that’s so awful it hurts:

$raw.Members.ChildNodes | select Member_Id,FullTitle, @{l=”firstname”; e={$_.listas -imatch “(\w+)(,\s)(ms\s*)*(dr\s*)*(mrs\s*)*(mr\s*)*(sir\s*)*(\w+)” |Out-Null ; $Matches[8] }},Gender | Export-Csv -Force -NoTypeInformation C:\Temp\mpnames.csv -Encoding UTF8

 

We end up with a CSV file that has Member_Id, FullTitle, firstname ,Gender.

You can see that we have to watch the encoding, using UTF as we pass from one application to another. Especially as we are cross platform.

R library names are highlighted.

Getting the gender from the name with genderize.io

Then I import the CSV file produce by PowerShell into R. Pretty easy just a call to the tidyverse method read_csv. I always do a bit of cleanup, in this case using the pipe operator %>% to send all the columns to rename_all to change their names to lowercase, as R is case-sensitive.

mpnames <- read_csv(“c:\\temp\\mpnames.csv”) %>%

rename_all(tolower)

Then I select the distinct names out using dplyr and then iterate through them.

I tested the genderize API using a browser and in PowerShell using curl, then wrote a quick ‘n’ dirty method in R to do the GET, parse the JSON and turn it into a 1 row data.frame.

Then I iterate the names using plyr, which has a nice method for taking a list of things, doing a foreach method call with them and scooping up the results into one big table.

I had a bit of bother with the genderise web service, because occasionally it would throw an error and I don’t program R well enough yet to do the error handling and I’d lose the whole set of results. That wouldn’t be a problem, except that the free version of the genderise API is rate limited to 1000 calls per day. So if I need to keep re-running I’ll soon run out. So my crude hack was to run 100 at a time then manually smash them together using dplyr’s bind_rows.

Yes it’s crude, but we are thinking about the results here, and don’t care about the programming!

genderDB <- plyr::ldply(.data = (distinct_names %>% slice(1:100))$firstname, .fun = getSingleName)

genderDB2 <- plyr::ldply(.data = (distinct_names %>% slice(101:200))$firstname, .fun = getSingleName)

genderDB3 <- plyr::ldply(.data = (distinct_names %>% slice(201:300))$firstname, .fun = getSingleName)

genderDB4 <- plyr::ldply(.data = (distinct_names %>% slice(301:400))$firstname, .fun = getSingleName)

gender_all = bind_rows(genderDB, genderDB2, genderDB3, genderDB4)

 

Analysing the results of genderise name scoring

Genderise returns a result for each name that gives a % probability, and a number of counts of the name in the genderise dataset:

{“name”:”peter”,”gender”:”male”,”probability”:”0.99″,”count”:796}

So if we want to be “certain” that a name has been accurately classified by gender, we’d look for a high probability and a large count. What’s a good number? Well you choose what “certain” means to you.

 

 

So we get a fair chunk of them that are not high confidence, either because they have low count of names or lower than 0.99 probability of correctness.

I create the plot with this code using ggplot:

# do a plot, but colour code the uncertain ones

ggplot(data = join %>%

mutate(high_confidence = (!(probability < 0.99 | count < 300)))) +

geom_bar(alpha=0.6,

mapping = aes(x=gender, fill=high_confidence))

 

Here we do a basic ggplot, which works by taking data and sending it a type of plot (the geom_bar here). Also here we create an extra field on the data set by using the %>% pipe operator to send the data through a mutate call that adds the new calculated column of high_confidence. The NA are where genderize refused to give a result because it didn’t have the name in it’s database.

Comparision with the original data from the government gender field is good! If we drop the confidence then we get a few errors creeping in. As you’d expect, I guess

Azure and what to do about the low-confidence results

So in this situation when we have some low confidence results, we can either filter them out, or accept them knowing that a few will be wrong (but a significant few?) or we can find another data source to try and patch the problem. Sometimes you might be tempted to manually process a few, but this may not get better results.

Something caught my eye while I was looking at the cases where genderise was wrong/uncertain Checking on the MP called “Hillary” who is a man, I happened across his picture. But I also found that the government API could also offer official portraits!

A quick google brought me to this Azure Cognitive services page, which offers many image analysis APIs. I gave it a try with this face:


Hilary Benn MP

And got back successful identification as a man’s face. I signed up for an API key using my MSDN account and put together this PowerShell call as proof of concept. This took about 20 minutes.

 

$memberId
=
172

$key
=
“00baf713c2eb4d48b118bdca2daf2d0e”

$imageUrl
=
http://data.parliament.uk/membersdataplatform/services/images/MemberPhoto/$memberId/”

 

$headers
= @{}

$headers.Add(“Content-Type”, “application/json”)

$headers.Add(“Ocp-Apim-Subscription-Key”,
$key)

 

$body
=
‘{“url”:”‘+
$imageUrl
+ ‘”}’

 

$response
=
curl
-UseDefaultCredentials
-Method
Post
-Uri
-Headers
$headers
-Body
$body
https://westcentralus.api.cognitive.microsoft.com/vision/v1.0/analyze?visualFeatures=Faces&language=en&#8221;

$json
=
$response.Content |
ConvertFrom-Json

 

write-host
“gender is $($json.faces[0].gender)

 

Pretty simple, uses a POST to ship the url of the image, or you can encode the image and upload it.

There is a great page for testing calls to the API

Getting it to work in R was a little more tedious. I tried to use the native httr library for making the http call, but I just couldn’t do it. It’s here that R’s syntax is a pain. As it lacks the object.method syntax, everything disappears inside brackets. To cut a long story short I think that the problem was encoding the body, but I couldn’t see what was happening. Also, the httr library seems to work at the TCP socket level so you can’t spy on the connection with Fiddler, which took me a while to figure out. So I fell back on previous approaches which use the MS COM stack. Again, cut and paste, people. It works.

Final results were that the high quality results from genderize combined with the image lookup were 100% success over the 600 MPs.

Conclusion patching data together gets better results

Using multiple sources of data improves results, however the additional processing is expensive both in computation time and in time to research and code alternatives. However the cloud ML services that are available are quick to wire in and allow access to the cutting edge in a few seconds. If I’d had to think about doing image recognition myself the project would have ended there, even though I know it is academically possible.

Also, hacking it just works when you care about the result.

 

 

 

*Other aside: thorny stuff, talking about gender. Is it OK to study gender like this? Well first, I think it’s a problem if leadership of countries/companies isn’t balanced. That’s a personal assumption and a bias. Second, I would study go further and study race, religion and nationality balance if I could do it in a way that respected the anonymity of those being studied… But we couldn’t do that without seriously stalking people via social media or other means; and that feels like a line has been crossed to uncover something personal about them. So why doesn’t that apply to gender? For a % of the population their self-identifying gender is as private as their religious beliefs. Well, you could argue that being in the leadership position of a public company is one that requires you to be accountable; and part of that accountability costs you a measure of your anonymity. You could also argue that gender is in the public domain for a vast majority of the population. Where the line is crossed is something that we haven’t figured out in a world that is awash with personal data. I think I’ve stayed on the correct side of the line.

Also, what happens if I create evidence that gender-balanced boards do worse for companies? I think that actually, any evidence that gender affects outcomes is interesting (and worrying) but would probably require much deeper research.

Quick loops and testing command performance

I heard about someone testing the speed of a powershell call using measure-command. I combined it with a quick loop and the cleverness of the pipeline to measure-object.

1..10 |% { Measure-Command -Expression {sleep 1}} | Measure-Object -Property TotalMilliseconds -Average -Maximum -Minimum

 

So the quick loop is to do 1..10 which produces a list of integers 1 to 10. We then pipe that into |% (which does foreach), and then into measure-command which produces an output like this:

PS C:\batchfiles> Measure-Command -Expression {write-host “hello”}

hello

 

Days : 0

Hours : 0

Minutes : 0

Seconds : 0

Milliseconds : 19

Ticks : 190751

TotalDays : 2.2077662037037E-07

TotalHours : 5.29863888888889E-06

TotalMinutes : 0.000317918333333333

TotalSeconds : 0.0190751

TotalMilliseconds : 19.0751

 

And then the pipeline combines the multiple calls into a “table” of objects (well every object has the same properties, which means the shell can print it like a table)

PS C:\batchfiles> 1..10 |% {Measure-Command -Expression {write-host “hello”}} | Ft

hello


hello

hello

Days Hours Minutes Seconds Milliseconds

—- —– ——- ——- ————

0 0 0 0 10

0 0 0 0 0

0 0 0 0 0

0 0 0 0 4

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

 

And then the rest is just the magic of measure-object.

 


 

RStudio grabbing data from Excel

Getting started with R can be the most tricky part, and the part where you lose out most to a click ‘n’ drag solution like Excel and Tableau.

So how can we get started faster and level that playing field? Well I must unlearn what I have learned about programming.


wise green data analyst

 

Well I just found this little “Import dataset” button in Rstudio which gets you up and running faster

 

This is another case where I need to change my mindset: R is not programming, it’s just doing the work of analysis. We don’t need to write a program, we need to do the analysis in this productivity environment that uses code. So don’t write a program that imports and cleans data from a file… click that button and just import the data to the environment and get started!

Clicking it give a few options (CSV, Stata, etc.) but of course the main one for us is EXCEL.

If lets you look at the columns and tabs in the sheet, skip rows that don’t contain data and you just click go and you have the data in your environment ready to go. Or you can cut the code that loads the file, just as easily.

Don’t code, analyse using code. Use the productivity features of the data workshop and you must unlearn what you have learned.