R Quick sample : downloading a file that you couldn’t use in Excel

In this example, I combine a data file that is annoyingly large (1.2 Gb) with a quick bit of exploration to start to see what is in it.

The data set is one from the UK government on prescriptions from doctors. Claiming to be every prescription from every doctor in the UK.

To get useful insights from the data I think that you’d need to

  • automate downloading many files, this 1.2 Gb is only 1 month’s data
  • parse the name of the item being prescribed, e.g., brand names, compare different dosages
  • start to look for trends in ordering, to see the depth of the order book of the companies
  • or to see the rise of generic non-branded drugs
  • combine with geodata to see where it’s happening
  • combine with census data to look for health concerns in different areas
  • .. come on! there must be more!

 

library(tidyverse)

 

# how to download a file? google it!

# from here: https://stackoverflow.com/questions/23028760/download-a-file-from-https-using-download-file

library(RCurl)

 

 

 

#https://data.gov.uk/dataset/prescribing-by-gp-practice-presentation-level

 

#For each practice in England, including GP Practices, the following information is presented at presentation level for each medicine, dressing and appliance, (by presentation name):

#the total number of items prescribed and dispensed

#the total net ingredient cost

#the total actual cost

#the total quantity

 

#http://datagov.ic.nhs.uk/presentation/2017_08_August/T201708PDPI+BNFT.CSV

prescription <- read_csv(“C:/Downloads/T201708PDPI+BNFT.CSV”)

prescription <- prescription %>%

rename(actual_cost = `ACT COST`) %>%

rename_all(tolower) %>%

mutate(actual_cost = as.numeric(actual_cost))

 

 

prescription %>%

ggplot(mapping = aes(x=actual_cost)) + geom_histogram()

 

prescription %>%

summarise(mean = mean(actual_cost),

median = quantile(actual_cost,0.5),

upper = quantile(actual_cost,0.75),

percent_90 = quantile(actual_cost, 0.9),

percent_99 = quantile(actual_cost, 0.99),

percent_999 = quantile(actual_cost, 0.999)

)

 

prescription %>%

filter(actual_cost < 5000) %>%

ggplot(mapping = aes(x=actual_cost)) + geom_histogram()

 

 

prescription %>%

filter(actual_cost < 1000) %>%

ggplot(mapping = aes(x=actual_cost)) + geom_histogram()

 

prescription %>%

filter(actual_cost < 10) %>%

ggplot(mapping = aes(x=actual_cost)) + geom_histogram()

Advertisements

Fuzzy match gets useful!

So I did another post on fuzzy matching, but this is one where I really used it!

Someone sent me the name of a computer, but they sent the wrong one, I couldn’t ping it. Clearly they’d made a typo. Rather than wait another day – they are in another time zone – I thought I’d find it myself with a fuzzy match.

So I started by exporting the whole list of computers from AD, which was easy from Powershell.

get-adcomputer -server domain -filter * | export-csv -NoTypeInformation c:\temp\computers.csv

Then I read it into R, using my old script as a starting point, and it took me only a couple of minutes to find it:

## fuzzy match on a computer name
library(tidyverse)
library(lubridate)
library(stringr)
library(stringdist)
library(fuzzyjoin)

# export the list of all computers by doing PoSh
# get-adcomputer -server domain -filter * | export-csv -NoTypeInformation c:\temp\computers.csv

c <- read_csv("c:/temp/computers.csv") %>%
 rename_all(tolower)

to_find <-tribble(~name, "svr1-admn-po01")

c %>% stringdist_left_join(to_find,
 distance_col = "distance",
 max_dist = 0.1,
 method = "jw") %>%
 arrange(-distance) %>%
 select(name.x, name.y)

results came out in a table, once I’d adjusted the max_dist parameter down enough, it was easy to eyeball the match.

Fuzzy matching in R for addresses

It’s a common problem: manually entered data has duplicate records, but not exact duplicates. It’s not exactly a classic problem of computer science but it’s common and feels generic.

Examples with overlapping problems:

  • Plagiarism, all student work at university is now passed through plagiarism databases
  • Matching records on a name (e.g., bank accounts with immigration records)
  • Addresses of customers
  • Comparing phone numbers or email addresses
  • Make / model specifications from multiple vendors
  • Comparing strings of DNA

From those examples we progress through very unstructured data, to quite structured. In the structured data we can probably do a lot to clean up mis-typings and so on because there are valid lists of entities.

A typical strategy is to use something that measures how different these values are. The classic is the Levenshtein distance which calculates the number of “edits” that are required to turn one string to another; e.g: Sam compared to Samm has an edit distance of 1 (1 addition), Sam compared to Smm has an edit distance of 2 (1 addition, 1 removal).

In our case we are interested in addresses, which is halfway along the spectrum from unstructured to structured. We might consider effects like missing lines (i.e., some people don’t include the county when writing the address, some do) and there are some tricks you can play with cleaning up city names and even street names to ensure edit distances are minimised. There are some more compsci friendly techniques like using hashes of k-shingles, but frankly I’m out of my depth there.

Where does R come in?

Well this is where having a decent library that you trust can help. I know that in the past when people did this, they wrote and tested their own implementation of the levenshtein metric. But with R, someone has already built a lib for it, and included all the other stuff you haven’t thought of. In this case, we want to match addresses using the Jaro-Winkler distance, which is another metric invented by a mathematician at the US Census bureau for.. matching addresses! We can also find the R library fuzzyjoin which works with tidyverse tables to do joins based on metrics rather than absolute key matches.

Then we need some safe address data to play with? Well the good ol’ gov.uk comes to the rescue with this data set of all Doctor surgeries in the UK. I pull in 2 sets from different dates.

So here’s some code that uses those libraries to pull in the data and then do a fuzzy join. I also adjust the fuzziness of the match to see how it affects how “joiny” it is.

I’m using 100 rows from the address lists because it runs faster, so if we join everything to everything, we’d get 10,000 rows. If we have perfect joins only then we’d get 100 or so joins (given all the doctors are still there in the same place!). It’s surprising how slow it is actually. Even with just 9000 records it is unbearable to wait for the 81 million comparisons to run. I guess the compscis and their shingling are maybe more useful than I thought!

So we can adjust the match threshold and see where the edge is for getting “useful” mismatches; that is, we discard places that are completely un-alike, just in the same city, but we pick up things that are nearly – or rather, suspiciously – the same!

Looks like in this case, the edge is about 0.2 and I think that those ones are worthy of human checking… I don’t think that I’d fire off a court summons or deliver an order based on this approach alone!

 

# fuzzy match sample using address data from data.gov.
# https://data.gov.uk/dataset/prescribing-by-gp-practice-presentation-level

library(tidyverse)
library(lubridate)
library(stringr)
library(stringdist)
library(fuzzyjoin)

cn = c("date_part",
 "surgery_id",
 "address_line_1",
 "address_line_2",
 "address_line_3",
 "address_town",
 "address_county",
 "address_postcode"
 )

may_data = read_csv("T201705ADDR+BNFT.CSV", col_names = cn)
june_data = read_csv("T201706ADDR+BNFT.CSV", col_names = cn)

join_id <- full_join(may_data, june_data, by = c("surgery_id"="surgery_id"))




mismatched_id <- join_id %>% 
 filter(is.na(date_part.x)) %>% 
 bind_rows(join_id %>% filter(is.na(date_part.y)))


squashAddress <- function(data) {
 data %>% 
 gather(address_part, address_part_value, -surgery_id, -date_part) %>%
 group_by(date_part, surgery_id) %>%
 mutate(address_full = paste(address_part_value, collapse = ",")) %>% 
 select(date_part, surgery_id, address_full) %>%
 slice(1) %>% 
 ungroup()
}




may_data_squashed <- squashAddress(may_data)
june_data_squashed <- squashAddress(june_data)

memory.limit(memory.limit() * 10)

# take a slice of the data for speed
trimData <- function(data){
 data %>% arrange(surgery_id) %>% slice(1:100)
}




# using the jaro-winkler string distance
# but only take a subset of the data, just to speed things up
doTheJoin <- function (threshold) {
 joined <- trimData(may_data_squashed) %>% 
 stringdist_left_join(
 trimData(june_data_squashed), 
 by = c(address_full="address_full"),
 distance_col = "distance",
 max_dist = threshold,
 method = "jw"
 )
}

#try different values for the JW threshold

# if you set the threshold = 1 then it matches everything!
everything_matches_everything_else <- doTheJoin(1)
# 0.4 gets "wrong" matches, things that are totally different addresses
total_mess_random_joins <- doTheJoin(0.4)
# 0.2 you get different surgeries at the same place
sensible_matches <- doTheJoin(0.2)
View(sensible_matches %>% filter(surgery_id.x != surgery_id.y))




# now we test what happens to the join results 
# as we vary the threshold
# we can see a fairly narrow range where JW distance varies 
# from highly selective to totally useless
threshold_list <- c(seq(0.2,0.5,by=0.02), seq(0.0,1, by=0.1)) %>% unique()
joinability <- purrr::map_dfr(threshold_list, .f = function(x){
 doTheJoin(x) %>% dplyr::summarise( count_of_joined_rows = n(), mean_distance = mean(distance, na.rm = TRUE))
 },
 .id = "threshold") %>% 
 mutate(threshold = as.integer(threshold)) %>%
 inner_join( as_tibble(threshold_list ) %>% rowid_to_column(var = "threshold"))




# then since this is R, we plot
joinability %>% select(-threshold) %>% 
 dplyr::rename(threshold_value = value) %>%
 gather(measure_type, measure_value, -threshold_value) %>% arrange(threshold_value) %>%
 ggplot(mapping = aes(x = threshold_value, y= measure_value)) +
 geom_point() +
 geom_line()+
 facet_grid(measure_type ~., scales = "free_y" )

VScode and git and me

I think that VSCode is now my go-to editor. It looks like it’s going to displace Notepad++ as my tool of choice.

I think that git is now my source code of choice.

And they go together quite well.

 

Why VS Code:

  • After a few years away from programming Visual Studio seems rather bloated. You might be used to it from using it every day but there are now too many buttons, I’ve realised that I never really know about those buttons anyway. I Suspect that someone who grew up on c++ programming might know more
  • It’s fast to load and I can have more than one open
  • Integrated shell, which I use often. I think that as your code tools expand out to include command line build and command line test runners, I think that the integrated shell pays off even though it doesn’t seem like it would make a difference to running in another window. You can have bash and cmd, I use powershell.
  • Powershell looks better, I think that vs code will become the powershell editor of choice and accelerate away from powershell ISE. I like the debugging some people have told me they find it flaky.
  • There’s probably an extension for what you want. I’ve not tried writing my own, but I’ve got them for R, python, ruby, Julia etc.
  • Git interface is nice

What I don’t like: a few baffling features like multi cursor editing and so many packages that there are dozens and dozens that scratch personal itches, but that’s life on a marketplace.

Why Git:

Really I’m talking about TFS git, which isn’t exactly git. It’s more like github with issues and pull requests and so on. I probably agree with people more knowledgeable than me that for source control mercurial is cleaner and better. And a wiser person than me said “Mercurial is probably better, but git has already won” and I think that’s a bit sad but true #betamax

  • Pull requests, it’s a better form of code review and it’s flexible and built it
  • Builds on branches are free; there’s no reason that MS couldn’t have done this on TFS version control but they didn’t and IMHO it’s a game changer
  • (minor point) Making documentation that lives with code in markdown that renders and is editable in the browser… that means that you have the option of documention for code living in the repo, and not in the wiki that is disconnected from it.
    • I think that wiki version control (and even bleurgh) sharepoint version control is mostly good enough, but I wonder if we’d ever use version control for contracts or something like it.

What I don’t like: it is mildly harder than central version control. For a full dev team it’s better for the reasons above. For a non-dev team it’s a little annoying sometimes. Like for instance: I checked in to a local repo on master but I can’t push to master, I need to move my commits to a local branch before I can push to a remote branch so someone else can merge them back to master? Having to understand this is annoying.

good enough for Billy Jack Buzzard is good enough for me.

Diversity in tech jobs

I think that everyone knows that tech is less than diverse. Let’s not get too heavy on the discussion of why I think that’s bad. Let’s just focus on opening up a new talent pool which means that instead of choosing the best of 10 applicants for a position, you get to choose the best of 15 applicants .

But wait! Aren’t the other 5 aren’t worse than the 10 you already looked at!? It’s just more bad applications? Well certainly taking on a very diverse talent pool might require you to pass on immediately relevant skills in order to get staff with keen minds, passion and aptitude. But would we recognise latent talent?

Probably not.

Anyway. The whole point of this was to call out Martin Fowler’s tweet about a non-profit called mined minds which is setting up coding boot camps to convert the huge over-capacity in miners in the US to coding, where there is a long term shortage. Those people are probably not addressing any kind of gender diversity gap:


Billy Jack Buzzard tells it like it is.

What’s interesting is how Mined Minds filter people for their courses. Yes, mining is a job where some people are smart peeps who just want a local job that pays good danger money, and yes there are technical elements. I’m sure there’s plenty of mechanical minds out there that are using to “debugging” technology of all kinds, even the heavy, greasy ones. But I am also fairly confident that there’s a lot of people cutting coal who haven’t ever used the part of their brains that does arithmetic, which is probably a blocker.

So how do they sort people who are more likely to be successful in this new career? I suspect that it’s something to do with this, that is linked on their application page: http://www.simcoachgames.com/booeys

Would you recognise a good gaming score as evidence of coding ability?

 

Data about packages

Finding packages and extensions and apps for everything from text editors to Spotify to programming environments* is 50% of the battle for productivity.

What better way to decide than with some data?

Here is some code for getting download data from CRAN on R packages:

install.packages(“devtools”)

library(devtools)

devtools::install_github(“metacran/cranlogs”)

 

cranlogs::cran_downloads(when = “last-month”, packages = c(“ggplot2”))

cranlogs::cran_downloads(when = “last-month”, packages = c(“FactoMineR”))

 

or you could look at the reviews as well:

https://www.crantastic.org/popcon

 

 

*Of course, it’s all about vertically sharded value chains and creating an ecosystem that allows disruptors to flourish. Or something. Fintech goes here. And blockchain.

Ungroup in powershell

Powershell often forms part of my data hacking arsenal. When I get a text file that’s so messy it’s not fit to be CSV yet, I PoSh on it.

So, you’ve imported some data into Powershell and you’ve got a list of stuff, but there are some keys that have more than one row.

You can look for the duplicates with:

$myData | group key

The output of the group-object operation is something like a dictionary, with a summary count, so you can easily filter them as:

$myData | group key |? {$_.count -gt 1}

But then you want the original rows back, not this dictionary..

Do this:

$myData | group key |? {$_.count -gt 1} | select –Expandproperty group

This also works on any nested object that is being returned in a property in a select.

There is also an “ungroup” operation in the R package dplyr.