New laptop with chocolatey

I’ve heard of chocolatey before, but when I was rebuilding my laptop this time I was nudged at just the right time*

I keep a list of everything I install onto my laptop so I can put it back together again when I lose my machine.

But this time when I was putting back tools like winmerge, I did it like this:

choco install winmerge

at the powershell prompt.

It doesn’t work for everything, but made a few steps a lot quicker. Perhaps for teams with standard tools, this would help. SCCM is another way, but these packages are maintained by someone else! So some risks apply.

Advertisements

R and global variables, unlearning programming habits for data analysis

I’ve said many a time that doing data analysis in R is not programming, and yet I have the habits of a lifetime to undo.

This is not about encapsulation and abstraction, it’s about getting a result. Less about unlearning programming and more that I need to learn the idioms of data analysis, and the way to use data analysis tools like RStudio.

I’m working on a quite involved piece of R code that I’ve been tinkering with and adding too for a few days, and it’s getting a little bit big. I mean it’s only a few hundred lines, but already (as with Powershell) without the structures of modules, classes and namespaces it’s getting messy.

So big, in fact I’ve started to separate sections with big comment banners like this:

When you have a comment banner, you have a problem.

 

Global variables are not evil

This is such a piece of folklore amongst developers, like “GOTO considered harmful” that it’s hard to give up.

My program follows the classic data analysis model:

(from R for data science)

I’ve just realised what I should be doing: I should have a few R script files, each that does a different bit, but they can communicate via global variables!

No! Wait, come back!

 

RStudio is not an IDE!

RStudio isn’t for programming. It’s a data science whiteboard, for playing with ideas. You keep the data in the environment while you are working; like a clipboard. They aren’t even global to the script! They are outside of the script, they even survive when you shut RStudio and start it again…

Use the global. I think that RStudio projects are perfect for embracing the global variable

 

Text Mining in R

So I heard from someone that they are using R to mine text, to look for sentiment in statements about the market.

I thought I’d give it a try, but instead using the Gutenberg Press text of Jane Eyre.

I used the TextMining package because I found that first, and it got me started, though I haven’t done any of the real work of analysing text (like looking for correlations between words).

But still, got me started.

#text mining

library(tm)

library(wordcloud2)

library(tidyverse)

 

docs <- Corpus(DirSource(pattern=”text_source*”, ignore.case = TRUE, encoding = “UTF-8”))

 

# don’t use this, it seems to break everything

# inspect(docs)

 

 

# clean the docs

docs <- tm_map(docs, removePunctuation)

# stopwords is a very slow step, avoid running it in demo

docs <- tm_map(docs, removeWords, stopwords(“english”))

docs <- tm_map(docs, removeWords, c(“will”, “now”, “one”, “said”, “like”, “little”))

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, tolower)

 

 

 

dtm <- DocumentTermMatrix(docs)

freq <-colSums( as.matrix(dtm))

ord <- order(freq,decreasing=TRUE)

 

tops <- freq[head(ord, 1000)]

 

 

wf <- tibble(word=labels(tops), count=tops) %>% filter (count < 500)

 

wordcloud2(data = wf)

 


PowerBI working with R scripts in R studio

Just found a nice feature in Power BI.

You can use RStudio as an editor for the script with R visuals, and PowerBI creates a dataset that can be loaded in R studio outside of the PowerBI context!

This is great, because building a dataset in PowerBI is really easy just a few ticks and it pulls only the columns you want… from any of the sources powerBI loads in… and RStudio has the good experience for working with R code and plots etc.

Just one bad thing: you have to cut and paste to get back to power BI! Yuck.

Read more here

Message architectures are different?

We have a number of teams that are considering – or already moving towards – message-based architectures. We made some mistakes coupling applications and databases, will we make mistakes with messaging that we don’t know about yet?

What is message-based?

I’m sure you can find some great definitions online.

Systems (driven by people or other systems) interact by putting messages into a store that offers some guarantees about storing messages exactly once. The messages can have metadata that allow them to be labelled in such a way that systems know something about the content of the message. We might have a queue of messages that preserves order and labels messages with a “topic”, or we might have multiple queues that is for a different use.

For me the big questions are:

  • Who are the actors?
  • When do actors act?

Who are the actors? Many-to-many interactions explodes the client-server picture

From the first time-sharing mainframes, we have had central “servers” that control a resource, with many clients calling in. With messaging that model might still be true, but isn’t always. Now the message queue is the central resource, but the things sending and pulling messages don’t have to be one to one. It also means that integration between systems that you built and systems that you bought changes, sometimes for the better.

Also: you now have no idea who’s doing your work, but on the plus side you don’t need to care who’s doing the work, so when it changes no harm done.

When actors act? Timing is everything

The timing of a synchronous call to the “server” is a simple thing (well not really, it’s actually hella complex but networking sockets fix most of it). You call, it responds. With messaging, that is no longer true. Threading is implicit now and is everywhere. If you previously had got away with not considering timing and ever used thread.sleep to fix something, then those days are gone.

There are 2 things about messaging systems that make a difference to the timing: the message transport and the message distribution. You might be able to have a guaranteed delivery and a guaranteed order to your messages, but if you do then you probably are giving up some other things. Also, messaging systems only get interesting when you have a message distributor that lets actors “subscribe” and or get push notifications or distribute to multiple clients.

Where’s my schema?

You used to understand changes to database schema, and how to make those changes. When we have a new database schema we have to migrate data. We know what changes in business requirements are easy to absorb, and what is hard. We know how to make a database schema that can absorb some changes. We’ve learnt how to source control our database schemas, or even test the database.

But with messaging a lot of that is gone, but we still have a structure. The different ways of listing messages by topic and in different queues offers a lot of ways of structuring data on the queues. That means that you have a lot of options and you can change things easily. But! That’s scary as well. Things that change can break. How do we source control all that? And who does the work? Developers or infrastructure/ops staff who are admins on the machine?

What’s interesting is that you don’t need to pick one model; when messages are used you can replicate into multiple structures independently. Of course, more formats is more work.

Read the book

This book is great. If you are working with messages you should read it. It shows what you can do when you stop thinking about messaging as a way of doing RPC and really put the message queues at the heart of your architecture.

Enterprise integration patterns

Here’s the site of the book slightly less easy to read. For me, the best part is this overview of application integration options.