Text Mining in R

So I heard from someone that they are using R to mine text, to look for sentiment in statements about the market.

I thought I’d give it a try, but instead using the Gutenberg Press text of Jane Eyre.

I used the TextMining package because I found that first, and it got me started, though I haven’t done any of the real work of analysing text (like looking for correlations between words).

But still, got me started.

#text mining





docs <- Corpus(DirSource(pattern=”text_source*”, ignore.case = TRUE, encoding = “UTF-8”))


# don’t use this, it seems to break everything

# inspect(docs)



# clean the docs

docs <- tm_map(docs, removePunctuation)

# stopwords is a very slow step, avoid running it in demo

docs <- tm_map(docs, removeWords, stopwords(“english”))

docs <- tm_map(docs, removeWords, c(“will”, “now”, “one”, “said”, “like”, “little”))

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, tolower)




dtm <- DocumentTermMatrix(docs)

freq <-colSums( as.matrix(dtm))

ord <- order(freq,decreasing=TRUE)


tops <- freq[head(ord, 1000)]



wf <- tibble(word=labels(tops), count=tops) %>% filter (count < 500)


wordcloud2(data = wf)


PowerBI working with R scripts in R studio

Just found a nice feature in Power BI.

You can use RStudio as an editor for the script with R visuals, and PowerBI creates a dataset that can be loaded in R studio outside of the PowerBI context!

This is great, because building a dataset in PowerBI is really easy just a few ticks and it pulls only the columns you want… from any of the sources powerBI loads in… and RStudio has the good experience for working with R code and plots etc.

Just one bad thing: you have to cut and paste to get back to power BI! Yuck.

Read more here

Message architectures are different?

We have a number of teams that are considering – or already moving towards – message-based architectures. We made some mistakes coupling applications and databases, will we make mistakes with messaging that we don’t know about yet?

What is message-based?

I’m sure you can find some great definitions online.

Systems (driven by people or other systems) interact by putting messages into a store that offers some guarantees about storing messages exactly once. The messages can have metadata that allow them to be labelled in such a way that systems know something about the content of the message. We might have a queue of messages that preserves order and labels messages with a “topic”, or we might have multiple queues that is for a different use.

For me the big questions are:

  • Who are the actors?
  • When do actors act?

Who are the actors? Many-to-many interactions explodes the client-server picture

From the first time-sharing mainframes, we have had central “servers” that control a resource, with many clients calling in. With messaging that model might still be true, but isn’t always. Now the message queue is the central resource, but the things sending and pulling messages don’t have to be one to one. It also means that integration between systems that you built and systems that you bought changes, sometimes for the better.

Also: you now have no idea who’s doing your work, but on the plus side you don’t need to care who’s doing the work, so when it changes no harm done.

When actors act? Timing is everything

The timing of a synchronous call to the “server” is a simple thing (well not really, it’s actually hella complex but networking sockets fix most of it). You call, it responds. With messaging, that is no longer true. Threading is implicit now and is everywhere. If you previously had got away with not considering timing and ever used thread.sleep to fix something, then those days are gone.

There are 2 things about messaging systems that make a difference to the timing: the message transport and the message distribution. You might be able to have a guaranteed delivery and a guaranteed order to your messages, but if you do then you probably are giving up some other things. Also, messaging systems only get interesting when you have a message distributor that lets actors “subscribe” and or get push notifications or distribute to multiple clients.

Where’s my schema?

You used to understand changes to database schema, and how to make those changes. When we have a new database schema we have to migrate data. We know what changes in business requirements are easy to absorb, and what is hard. We know how to make a database schema that can absorb some changes. We’ve learnt how to source control our database schemas, or even test the database.

But with messaging a lot of that is gone, but we still have a structure. The different ways of listing messages by topic and in different queues offers a lot of ways of structuring data on the queues. That means that you have a lot of options and you can change things easily. But! That’s scary as well. Things that change can break. How do we source control all that? And who does the work? Developers or infrastructure/ops staff who are admins on the machine?

What’s interesting is that you don’t need to pick one model; when messages are used you can replicate into multiple structures independently. Of course, more formats is more work.

Read the book

This book is great. If you are working with messages you should read it. It shows what you can do when you stop thinking about messaging as a way of doing RPC and really put the message queues at the heart of your architecture.

Enterprise integration patterns

Here’s the site of the book slightly less easy to read. For me, the best part is this overview of application integration options.


Enabling the cloud

TL;DR: I worry that our on-premise built-from-scratch applications are reaching their limits. More of the same won’t help us keep up with the best of breed. Possibly the sweet spot is open-source solutions running on the cloud; we get exceptional technology with someone else handling the awkward bits. Maybe that could transform how we build with greater agility, scale and lower costs. But what are the enablers?


Caveat: contains opinions. Some of this is culled from other’s opinions and whispers, the blame for the whole thing is mine alone. I know I haven’t thought of everything, I’m just looking for upside to balance .


Why the future is cloud?

Let’s recap for those at the back:

  • Lower costs
  • Much greater flexibility
  • Freedom from the patch cycle
  • Best-of-breed applications


We could just go along with this and get some VMs in the cloud. That alone would be decent. It would probably reduce costs a lot, and the freedom from the patch cycle would be a substantial step. Let’s not forget that on a modern stack we are talking about

  • Patching the .NET framework
  • Patching SQL server
  • Patching the OS
  • Patching Vmware and the VMWare management system
  • Patching the SAN and blade controllers
  • Patching the firmware (rarely)


And don’t forget that this patch cycle is part of the vendor cycle, one of their tools to steer us along. When they stop offering patches and start saying “that’s fixed in the latest version, you need to talk to the sales team” you know where this is headed.



What is clear that cloud installations are going faster than on-premise solutions, even Microsoft is starting to deliver features in cloud solutions which will not be on premise (e.g., SharePoint, business intelligence). Concentrating on a single code line means that development starts to go even faster.


But if we move beyond using VMs on cloud hardware, what are we moving to? Won’t we risk the same level of vendor lock in that we have in hardware?


Open source systems are the key to cloud

With closed-source cloud, we are just buying a pig in a poke, locking ourselves into a vendor, no better than being locked into a hardware vendor. The “only” advantages would be lower costs, but when we factor in the cost to migrate.. And then migrate away to another vendor..?


However, maybe an open-source cloud-hosted solution offers a “3rd way”. When we get an open-source hosted solution, we retain the option of switching vendors because many vendors will offer a similar service. We can switch with a low level of disruption, even if it isn’t for free. Or in extreme circumstances coming back on-premise with a self-hosted solution while we rethink and keep the lights on.


Open source is a mess?

Open source is hard to handle. Support from proprietary software sucks, but support from open source is worse (don’t tell me the community supports it unless your Stack Overflow rep is over 5k). Often with open-source, there may be a complex set of dependencies, or a hardware requirement. However, if can get it provisioned on cloud, we get a good installation of a best-of- breed system for pennies on the dollar compared to what we can do in house. Yes, compared to on-premise open source, we lose the ability to run custom builds, but that is a rare case. Don’t forget we miss out on the patch cycle!


Open source is best?

Open-source is not one speed. Different systems and products are moving differently, but that is also true for proprietary systems! The open source “sandstorm” is still moving quickly and predictably, even if the “storm front” isn’t a straight line.


Removing the blocks to cloud

So open source on cloud is the target, how do we get there?


Key enablers:

  • Freedom from co-located users and machines
  • Freedom from a whole machine
  • Security
  • Data sensitivity
  • Legal, social and cultural obstacles


Co-located machines

We need to break this habit, for sure. We will never have all users in the same location as the server, and we already don’t! But the future will surely make this more common. The minimal change is to using managed data centres containing our own physical kit. Even that will put data and people in different locations, which makes it much more likely that servers will be communicating across locations more.


Whole machines

IMHO we need to start getting off whole machines and onto something smaller. Why? Because there’s no point in having a whole VM in the cloud, that defeats multi-tenant cloud which wants to span hardware. VM migrations across physical machines in a cluster are a pain, causing “stop the world” pauses in processing and “split brain” and all sorts. We need a lighter container and they are maturing very rapidly. I use “container” loosely/incorrectly; I mean a thing that is not a server with a full OS. It might be a headless server or whatever, I’m no expert. I think that the discipline enforced by not having a UI will be worth it, flushing out hidden config into version-controlled text files. Sadly not all the best containers are windows only.


Being able to put an application in a lightweight container might also mean that we are able to migrate more smoothly to the cloud, or operate a hybrid across multiple providers or partly on premise.


It may also make operating open-source systems a little easier; rather than “installing” the software, you just download the container with the application already inside it.



To enable cloud we need to we need to fix security from data in-flight. Within our firewalled boundary we have not given that much concern. Though we do secure endpoints well. To take it to the next level, we may also need to have freedom from the domain and the full trust that gives us. Machines that operate outside the domain may not be able to call on it for authentication. Or we could move our authentication to a cloud service, either one we own or something else. That might mean that we are handling security in a platform agnostic way, which further frees us to consider other operating systems and other types of container.


Node.js doesn’t run on bare metal just yet does it? But there are JVMs that do!


Data sensitivity

Can we handle these issues with obfuscation and encryption? Obviously whatever we can encrypt, someone else can decrypt. Especially in places where the NSA has got into the hardware; there is no defence against that, whether it is legal or not! But really, are the NSA interested? And are we any safer running on our own hardware which is inevitably less well protected than a first-class hosting facility. Our physical and logical security is decent, and certainly fit for purpose with the risks and threats we come up against.


We can certainly address these problems with test data sets, or extremely old data which are suitable for testing. Some teams only trust real data for diagnosing bugs. Each team might use similar patterns to obfuscate, but each schema would need addressing one table at a time. Divide and conquer is probably a technique that would help us improve security even in an all on-premise design, separating high and low sensitivity elements.


Legal, social and cultural issues

The big one. We have to ensure that for the future of our firm we are not making a wrong turn here. There are true risks that are potentially very serious. We must ensure that if political and legal circumstances change, our decisions are reversible. We also need to sell the upsides of this, which are vague but I think that they are very real, and decisive. The risks feel more tangible, but if we can’t adapt to cloud we inevitably miss out as the rest of the world gets comfortable with it, and providers offer more and more in the cloud and less on-premise.


Baby steps from on premise to multi-tenant hosted cloud solutions

IMHO, the ultimate is saas, not paas or iaas. We don’t want a VM running on someone else’s server, though that’s better than doing it in-house. Of course many concerns apply, ensuring the SaaS is an open platform and can be integrated and modified and can run our custom code where we need to. We also need to ensure what we are building is aligned to the needs of the business.. Or cuts across an aspect that every part of the business needs.


Maybe we can move through like this:


  • (currently) Co-located users and servers, sites connected by expensive WAN
  • Co-located servers, users in another site, connected by WAN, data centre contains our hardware
  • Co-located server, running on someone else’s hardware, guaranteed provision rates (ie., physical cores per vm), user connected by someone else’s WAN
  • Co-located headless server, running on multi-tenant hardware, users connected by someone else’s WAN
  • Containers running on multitenant elastic cloud, we don’t know where, connected by someone else’s WAN
  • Containers running on many clouds (debateable), connected by internet


How long will this take? Being cynical I could say that it will take one generation of managers. When the staff that are joining now are running the firm, the journey to the dark side will be complete. More prosaically: 5 years.



Lots of work?

Being realistic, this is a lot of effort. I believe it is the future and that the best solutions will involve at least part of this model.


Conway’s law revisited

You may have heard of Conway’s law. It is a hypothesis that when an organisation makes a piece of software, the architecture of the software mirrors the org structure.

However, I just looked at the original paper, rather than the Wikipedia article and I found it worth a read. He presents not only the main theory on structure, but also other observations about how we break a large problem down and why we break it down. There are also some great points on how managers avoiding risk/blame makes these kind of consequences inevitable. Trend number 1, people aren’t getting any smarter. Basically, whichever comes first the software or the organisation, this kind of thing is almost inevitable.

You have to spend a little time getting over the constant military-industrial complex references (you can just feel the cold war hanging over you as you read it). Remember this is 1968 people scarcely even recognised programming as an activity, I quote:

    “The term interface, which is becoming popular among systems people, refers to the inter-subsystem communication path or branch represented by a line in Fig. 1. Alternatively, the interface is the plug or flange by which the path coming out of one node couples to the path coming out of another node.”

It’s been a while since I heard people explain what an interface is. Flange.

De-pivot to pivot

If you’ve ever wanted to plot some data that isn’t pivoted in quite the right way, but you can’t figure out how to transform it…

Hadley Wickham (who writes of lot of great R libraries) wrote this article called Tidy Data. It’s the source of many great 5 dollar words. Even if you don’t use R, it’s worth a read.

One I particularly like is what you do when you have data that isn’t pivoted in the right way, but you need to de-pivot it in order to pivot again.

This data:




Transfer in

Transfer out






















Is already in a summary form. If I were working in SQL I might do:

Select year, Joiners, ‘joiners’ as type


Select year, Leavers, ‘leavers’ as type


Select year, TransferIn, ‘TransferIn’ as type


Select year, TransferOut, ‘TransferOut’ as type


Wickham’s word for this is “melt” and he has a function for it in the reshape package.

In R (I save my data as CSV, yuck!):

before <- read.csv(“C:\\temp\\head-before.csv”)

melt(before, id=c(Year”))


and the result:

Year variable value

1 2012 Joiners 10.0

2 2013 Joiners 15.0

3 2014 Joiners 6.0

4 2015 Joiners 10.0

5 2012 Leavers 4.0

6 2013 Leavers 7.0

7 2014 Leavers 6.0

8 2015 Leavers 8.0




Damn hard to do in Excel.