Gitting better at git

I use git for my work with RStudio but only in a very crude “click this and then click that” way. You know the ritual: stage-commit-pull-push and pray that nothing goes awry.

Of course, anyone wise knows that you have to know git on the command line. Not to be a guru, but to be effective.

I’ve been working with this book today: and it’s excellent. I bought my own copy.

After about an hour with the book, I checked in from the command line from my embedded powershell terminal inside RStudio, and didn’t I feel like a grown up?

What’s good about it?

  • lean and brief; well written
  • gets you working within 5 minutes
  • dispels fear by repeated practice

Caveat: It’s deeper than my knowledge, so I can’t tell if it contains the deepest advice. Also – for a windows dev – I’ve been on the command line a fair bit.

I’m recommending this book to everyone who goes near git.


Waiter! Conference for 1, please!

So, I heard from a few people that they’d been to some conferences, and I realised that I’ve not been to any kind of external training for more than 5 years. I’ve been lucky enough to change job and need to learn some new stuff on-the-job, but no conferences.

So I decided to do a staycation-style conference and do it all at my desk. 

I asked my manager and got agreement that this stood in for out-of-office conferences. I wanted to get that “saturation” effect that you get from a conference where you spend all your time thinking about new ideas, so that you are actually working on those new ideas even when you aren’t watching the content. 

Below are notes on the talks etc. I’ve put my favourite talks close to the top, but YMMV.

What I did:

  • looked for chunks of content which is quite recent rather than just watching random youtubes
  • spend as much of a day as I could doing it, to saturate my mind, which was more tricky than I thought.
  • I paused the videos occasionally to go off and google things, make notes etc.
  • sometimes I felt that I needed to know more to get the best out of sessions, so did some mini vids/reading before
  • watched most of the videos at 1.5x or 2x speed
    • this makes you a bit crazy after a while, and normal speed speech seems veeerrrrrrryyyyyy slloooooowww with huge pauses

Motherlodes of content

Main topics

Main topics that I got into, I colour code below

  • R
  • R + XXXX; where XXXX is a data science tech like Tensorflow, Spark etc.
  • bringing R to an organization (what I learned here is we are following a classic path…)
  • Nu-architecture
  • Docker / Kubernetes
  • Observability / devops++
  • Continuous deployment / release

No, I did not do any blockchain talks.

What I would do/differently

Overall, I think it worked. I would do it again but “turn it up to 11” and block off even more time. I also didn’t do any Qcon talks, as the 2018 ones weren’t published yet. That’s a foolish thing to block me, I know.

My notes

I’m Pwned. You’re Pwned. We’re All Pwned

  • has 320 million compromised passwords.
  • Shodan: google for IoT contains many unlocked devices

Pros: fun overview of internet security

Cons: not much implementable information

Building a Raspberry Pi Kubernetes Cluster and running .NET Core

Great, just for dizzying, vertigo-inducing stack of technologies:

Compiling a serverless function, in a docker build, on a windows machine, to target ARM processor on a linux machine, so it can be pushed to a kubernetes cluster running an openFaas serverless function.

Pro: really fun talk

Cons: not enough about real work, hobbyist and educational, which isn’t my business

Machine Learning with R and TensorFlow (Rstudio Conf)

Pros: Great overview of tensorflow and R, great links onward for more info

cons: if you already know tensorflow it is less exciting

GitOps – Using Git as your source of truth for build, deploy and observability

  • Trying to encode infra as declarative config rather than imperative “do this, do that” scripts that build a server
  • … then source control that config
  • Then building on docker and kubernetes to implement: compare reality to the config in real time and fix it 
  • Essentially instead of deployment being a push from a build, it’s a pull from the production system
  • Include the monitoring, security etc. in the config.
  • Describes the world of controls audits as being “full of 3rd party tools that don’t do half the things they say… it’s a world filled with psychopathic bullshit”
  • “a system that is observable should also be controllable”
  • Keeping production secrets in source control, how to keep that safe (see also
  • Monitoring is for the key metrics that you already know are important and that you need to maintain in a quick overview, observability is for everything else, particularly investigating problems.
  • With a complex distributed system you should design some observable criterion about the system before you make the config  change in production.. You can’t “test” these changes because you don’t really have a test system that responds in the same way as production.

Pros: good end to end talk on kubernetes / continuous deploy, good alternate view of production controls

cons: very far from where most people are

Hadley Wickham: Managing many models with R

Hadley talks about the gapminder dataset, and how to do many models at the same time. How to use the purrr package to do that.

Pros: good talk on modelling in R, very good quick summary of how to use lm and purrr

Cons: gapminder dataset is a bit distracting.


Testing in production

  • Want to be able to deploy a change to production within 5 minutes.
  • Increasing speed and accepting increasing risks
  • Fast rollback
  • Testing in production, deploying from trunk all the time means a rigorous way of making a change that is small enough to commit to master/trunk but not broken
  • And other things about using feature switches to do “dark launches”
  • Being able to see deployments on the monitoring
  • Casually stated that “of course you can’t do this with things that make payments”.. But not obvious why you can’t; after all we’ve already done the testing that it “works on my machine” so is logically correct. Maybe there’s concern that you can’t mock out a payments engine in production, or maybe that doesn’t differentially improve testing quality.
  • Other elements: mob programming
  • Monitoring driven development: for small changes in performance
  • 15 pairs each deploy about twice a day

Pros: quite similar to what we already do, so lots to like, explains how to be awesome at incremental improvements on existing functionality

Cons: confirmation bias, doesn’t offer a lot that is really really new, testers won’t like it, doesn’t explain how to scale this up to do breaking changes other than using feature toggles


Observability: it’s not just an ops thing

  • Not about seeing that “on average” most queries are completing in 5s, no one cares about the average they want to know why their query isn’t working
  • Exploring data: we want sub-second response for 95th percentile, we don’t want to break someone’s flow while they are investigating
  • More on feature flags, and deploy before you release.. And then adding this feature flag to the observability data
  • Monitoring driven development, where you make
  • Using sampling as a way of keeping a long history without keeping all the data.

Pros: exciting talk about really hard problems, advocation the close dev/ops working relationship

Cons: slightly chaotic delivery, questionable direct relevance to anyone less than facebook scale, no one asked her how they made the transition or if they were born on that side of the world of complexity

What is programming anyway?

  • Discussion of how we can teach programming to non-programmers, including children
  • Is programing like natural language? Or more like maths?
  • Metaphors matter because the more sure that people are that ability is innate rather than trained, the less women participate in it
  • The language metaphor helps because everyone can do it but only after practice, and you need to maintain that practice.

Pros: interesting if you want to broaden the appeal of coding (i.e., get people doing data science!), create diverse programming jobs, make people believe that code is the solution

Cons: slow to start, not really about work

Manning: Docker in Motion

Pros: solid motivation and intro to docker

cons: free content ends before we learn enough, but maybe the full course is great


Docker in 5 minutes

Pros: gives a bit of the history, very fast, fun

cons: old

The children’s illustrated guide to Kubernetes:

very short, introduces words.


Introduction to microservices, Docker and Kubernetes

  • not a conference talk, home rolled one
  • A demo of getting a docker container running, and then sending it into kubernetes. 
  • Start at the demo point and watch at 2x. 🙂

Pro: decent demo, all the deets

Con: slow to start, irrelevant attempt at explaining microservices, not better than other explanations, books, etc.


NDC: Identity server for ASP.NET core 2

5 verbs of authentication: SignIn, SignOut, Forbid, Authenticate(take a credential, turn it into a claims principal), Challenge

Pros: detail on new features of identity server and how it works with authentication providers

Cons: you need to know how IDS works and integrates into everything, hard to get excited about if you aren’t deeply familiar with ASP.NET core v1

Kubernetes for sysadmins

  • Allowing kubernetes to mount a filesystem that is raw and not on the host machine or the node… so you can detach the running process and re-attach another one.. Would that really work for a database?
  • But of course that assumes that the storage is fault tolerant
  • Actually a pretty good demo of a scaled out web app running in kubernetes

Pros: good speaker, one of top faces of kubernetes, good demo of bringing up an app in kubernetes

cons: linux focus

Sports data viz in R

  • Suggest using ggvis in shiny when plotting large datasets because of render time

Pros: good introuction to the different options d3, plotly

Cons: other than looking at the comparison to js to R, not much more

Large scale machine learning

  • Showing rstudio on the google cloud ML demo
  • And deep learning on the GCML and how you train for lots of models on that.

Pro: short, nice demos

Con: not implementable for us

Deploying tensorflow models

  • About turning tensorflow network models into services
  • Which you can do with an r package.
  • Or you can deloy the model with rstudio connect on prem
  • Or you can encode the keras model into javascript and run it standalone in a web page

Pros: strong demo

Cons: relevance for us

Building spark ML pipelines with Sparklyr

Pros: strong on demo, plenty of example code.

Cons: short on motivation, doesn’t say why Spark.


Language acquisition in Minecraft with reinforcement learning


Pros: totally different talk, totally different learning method, interesting minecraft links!

Cons: talk isn’t great.


Push button publishing in rstudio connect

  • Some interestint thoughts about using R as a first class member of the overall corporate dev ecosystem, and stages you might go through up until that point.
  • Fantastic sales pitch on R studio connect

Pros: short, good demos

Cons: doesn’t admit our developer-centric controls model


Parameterized R markdown

Practical demonstration of how to do this R markdown, and use on Rstudio connect server

Pros: very practical report for R programmers, short, just a few minutes

Cons: maybe you knew already from reading about Rmarkdown

Drilldown data discovery with Shiny

Pro: nice demo of an interactive shiny app, also nice link of a R analysis to a google docs data set.

Con: quite specific about the UI stuff, maybe unsuitable for an org that has full-time developers.


The R admin is RAD

Pros: good ambition on introducing R, great demo on shiny

Cons: asks you to have faith that it’s good, no concrete answers obviously.


R panel discussion

  • How to scale up data science team and embed R
  • Interesting comments around not worrying about how to productionize and change control these data science efforts, that the people doing data science need to be able to do it freely without worrying about that *yet* if they have to worry about it then they won’t create
  • “..crazy things are going to happen, people are going to take a million by a million matrix and multiply it by another million by a million matrix…”
  • “we value innovation more than stability”
  • Preventing people from getting attached to a physical environemnt, like a weak variant of chaos monkey, where you move onto new servers to enforce independence from infrastructure.
  • “Scientific debt” for firms
  • Validating open source tools
  • “don’t confuse change management with transition management, change management is about ensuring people have new tools and have access to those tools and skills provisioning hardware, transition management is this hard thing where  you are changing people’s identity they were previously an expert in the thing that they did and now they are going to have to be new at this thing. And in their minds they were an expert in this thing and they were this person… and identity and people and their feelings.”


2nd R panel

  • Tidyverse discussion
  • biggest insight is that there’s an effort to get stats models implemented in a tidy way in a paid-for effort
  • No real merits over and above following these people on twitter. Sorry.


Introduction to microservices, Docker and Kubernetes

  • not a conference talk, home rolled one
  • A demo of getting a docker container running, and then sending it into kubernetes. 
  • Start at the demo point and watch at 2x. 🙂

Pro: decent demo, all the deets

Con: slow to start, irrelevant attempt at explaining microservices, not better than other explanations, books, etc.


NDC: Implementing Authorization for Applications & APIs

Demo-based talk of a sister-project of identity server for managing the authorization policies that will result from modelling real business process.

Pros: practical examples, relevant for use as we use identity server

Cons: only interesting if you plan to use policy server


Compositional UIs – the Microservices Last Mile

  • Good blast through “what is a microservice”
  • What that looks like in real life in corporations

Pros: he’s a good speaker, interesting topic for architects, big problems that need big thoughts.

Cons: a bit slow to start, took 20 mins to get to conway’s law, big things that only apply to big teams writing conjoined web apps facing large numbers of users.. Big epic problems that he then proposes some actual code to solve, when he’s talking about problems of organisational dysfunction.


Deploying Windows Container based apps using Kubernetes

    • Interesting side point: windows, linux, arm devices.. Now have one workflow for all these
    • Sql server installs in a container.
    • Dev environments could be in a container, same as the staging/prod environments
    • Grafana? Vs splunk

    Pros: great coverage of docker, good introduction, decent steps towards kubernetes and using that in a mixed mode windows and linux environment

    Cons; lots of chat, no demos.


    Hack your career

    I summarise: get a blog, get github, get twitter, do some work in public, get known for being a blogger and speak at meetups and conferences until you get made redundant then turn that window of money with no work to achieve success then {repeat as needed…} you can work from home and live in a mansion.

    Pros: inspirational, living the tech dream, some realistic messages

    Cons: lacks enough specifics to be really useful, doesn’t make enough of the sacrifices and the compromises that surely would be needed


Diverse roles in tech will lead to diversity…?

Welcome to international women’s day.

It’s not controversial to say that technology jobs aren’t filled by men and women equally. I think that there is an opportunity for more diverse working methods and more diverse thinking producing better results. I can’t prove this, I think it is true because most of my business-facing development work would benefit from multiple points of view.

Great; so how would we attract diverse talent… And more importantly, how do we ensure that we get the benefits of that diversity? Because – in my opinion – acquiring diverse talents and then forcing them into the same suits and ties, the same modes of interaction is asking for a disappointment.

So how do we allow diverse talent chances to flourish in technology. There’s only one way to write code, right?

Well, I think that R and other data analysis languages are a chance to create some diverse roles that not only are filled by diverse backgrounds… but get woven into current roles. That type of programming is accessible because it stays close to what you already know and doesn’t require you to change tribe. So it’s not like you need to be an ex-stats PhD to use it, you don’t need to be an ex-developer…  You don’t need to be an ex-anything, you can be an accountant who uses it now, you can be a business analyst who mangles HR systems or anything else, all you need is to be motivated and believe that you can.

That accessibility actually carries through into the language, the way you use it to solve problems, the kind of problems you want to solve with it, and the kind of outputs you get (rich data, models, charts, diagrams, etc.). All diverse, all open and all accessible and attractive to newcomers.

I think that this has happened before; home computers democratised computers and took them out of university maths departments; web tech put design right in the middle of the developer job with coding tech like HTML, CSS and even javascript. I think that a web design shop has now many diverse roles from plain coder, to hybrid design/coders and visual designers all working together. That just didn’t before 1995. Probably games have moved the same way; though you could challenge me on the demographics there.

So, it might not be directly an issue for women, but it could be a factor. If you look at the number of #RLadies out there..

And now for an R pun: I’m think that we should do something with suff-R-gette?

Powershell log all your parameters quickly

In later versions of Powershell, rather than log all your parameters being passed to a function you can do this:

function testEmail{

    param($streetAddress, $displayname, $jobtitle, $deskphone, $mobile)

    write-host $PSBoundParameters


and that puts out this:

[displayname, Saptarshi Sengupta] [jobtitle, Tester] [streetAddress, Wimbledon Bridge House, 1 Hartfield Road

London, SW19 3RU] [deskphone, +44-20-7042-2219] [mobile, +44-788-131-2765]

nicely formatted and adapts as you change the parameter list.

Learning programming with the Software Carpentry project

I’m interested at the moment in people who are learning programming in R. Because it’s R it means typically that the learners have come from a background without the huge amount of tacit information that is part of the programmers body of knowledge*

I’ve been aware of a project called “software carpentry” ** for a while, which is a project intended for researchers – particularly those who are graduate students working in research – to equip them with the fundamentals of working with code / data to carry out research, but not to try and make them programmers. Having been through that myself without anyone telling me how to store data, or read it in, or what languages I should typically use. I can say that having to discover all this for yourself is a waste of time.

In that spirit, they have also produced this book on learning how to teach.

the core message being: you can – and should – learn how to teach, not try to figure it out alone

Also, you should expect to work with master teachers coaching you as you do the actual teaching, and with a group of peers who are also learning because this will shorten the way to being a master teacher (or, some people will never make it).

It’s here:

Of course ,because it’s all about the type-A academic people, there’s lots of references to studies and data to support it, but also has real exercises and things to try.

This is familiar ground to those who have spent time thinking about how to learn, but seeing the expression “teaching as a performance art” in print made me very happy; that’s something that I’ve long believed from close observation of teachers in training.


*I referred to these people as “lay programmers” which has a rather sniffy air to it, like “lay” meaning “not good enough”. I guess it’s jargon from one programmer to another that says “these people don’t have the Body Of Knowledge (BoK) therefore you will expect to see these pathologies in what is produced” (exactly which pathologies are part of the BoK of course..). But yes, it is a bit sniffy. Sorry.


** you could argue software carpentry has the same sniffy tone. I don’t. We could have a whole other discussion about the 2 sides of the construction industry.

R Quick sample : downloading a file that you couldn’t use in Excel

In this example, I combine a data file that is annoyingly large (1.2 Gb) with a quick bit of exploration to start to see what is in it.

The data set is one from the UK government on prescriptions from doctors. Claiming to be every prescription from every doctor in the UK.

To get useful insights from the data I think that you’d need to

  • automate downloading many files, this 1.2 Gb is only 1 month’s data
  • parse the name of the item being prescribed, e.g., brand names, compare different dosages
  • start to look for trends in ordering, to see the depth of the order book of the companies
  • or to see the rise of generic non-branded drugs
  • combine with geodata to see where it’s happening
  • combine with census data to look for health concerns in different areas
  • .. come on! there must be more!

# how to download a file? google it!

# from here:



#For each practice in England, including GP Practices, the following information is presented at presentation level for each medicine, dressing and appliance, (by presentation name):
#the total number of items prescribed and dispensed
#the total net ingredient cost
#the total actual cost
#the total quantity
prescription <- read_csv("C:/Downloads/T201708PDPI+BNFT.CSV")
prescription <- prescription %>%
 rename(actual_cost = `ACT COST`) %>%
 rename_all(tolower) %>%
 mutate(actual_cost = as.numeric(actual_cost))
prescription %>%
 ggplot(mapping = aes(x=actual_cost)) + geom_histogram()

prescription %>%
 summarise(mean = mean(actual_cost),
 median = quantile(actual_cost,0.5),
 upper = quantile(actual_cost,0.75),
 percent_90 = quantile(actual_cost, 0.9),
 percent_99 = quantile(actual_cost, 0.99),
 percent_999 = quantile(actual_cost, 0.999)

prescription %>%
 filter(actual_cost < 5000) %>%
 ggplot(mapping = aes(x=actual_cost)) + geom_histogram()
prescription %>%
 filter(actual_cost < 1000) %>%
 ggplot(mapping = aes(x=actual_cost)) + geom_histogram()

prescription %>%
 filter(actual_cost < 10) %>%
 ggplot(mapping = aes(x=actual_cost)) + geom_histogram()

Fuzzy match gets useful!

So I did another post on fuzzy matching, but this is one where I really used it!

Someone sent me the name of a computer, but they sent the wrong one, I couldn’t ping it. Clearly they’d made a typo. Rather than wait another day – they are in another time zone – I thought I’d find it myself with a fuzzy match.

So I started by exporting the whole list of computers from AD, which was easy from Powershell.

get-adcomputer -server domain -filter * | export-csv -NoTypeInformation c:\temp\computers.csv

Then I read it into R, using my old script as a starting point, and it took me only a couple of minutes to find it:

## fuzzy match on a computer name

# export the list of all computers by doing PoSh
# get-adcomputer -server domain -filter * | export-csv -NoTypeInformation c:\temp\computers.csv

c <- read_csv("c:/temp/computers.csv") %>%

to_find <-tribble(~name, "svr1-admn-po01")

c %>% stringdist_left_join(to_find,
 distance_col = "distance",
 max_dist = 0.1,
 method = "jw") %>%
 arrange(-distance) %>%
 select(name.x, name.y)

results came out in a table, once I’d adjusted the max_dist parameter down enough, it was easy to eyeball the match.