Ungroup in powershell

Powershell often forms part of my data hacking arsenal. When I get a text file that’s so messy it’s not fit to be CSV yet, I PoSh on it.

So, you’ve imported some data into Powershell and you’ve got a list of stuff, but there are some keys that have more than one row.

You can look for the duplicates with:

$myData | group key

The output of the group-object operation is something like a dictionary, with a summary count, so you can easily filter them as:

$myData | group key |? {$_.count -gt 1}

But then you want the original rows back, not this dictionary..

Do this:

$myData | group key |? {$_.count -gt 1} | select –Expandproperty group

This also works on any nested object that is being returned in a property in a select.

There is also an “ungroup” operation in the R package dplyr.

Enterprise tech meetup

I went to a tech meetup with the thrilling name “London Enterprise Tech Meetup“.


who says tech has a diversity problem?

The audience was mostly as smart as the speakers, so there was some good debate, some good points, some – as one speaker put it – “mini presentations disguised as questions” J . My mini takeaways:

  • People doing similar things are easiest to understand, and they have similar problems… the downside is of course that listening to them is a bit of an echo chamber, you hear your own problems reflected back
  • It provided a datapoint on what “big” is; a few comments that above a billion data points then even data platforms are struggling and you need something really bespoke.
  • Julia as a programming / data science system has some interesting stuff that might be worth a look
  • Hadoop got a real bashing, which was news to me
  • Natural language processing can be well integrated into human-lead research;
    • The example given by Pontus Stenetorp was drug interactions.
    • Say there are 8000 types of drug each with a detailed document of how it works
    • There are unknown interactions between these drugs that have not been discovered
    • Human doctors would have to read and understand all 8000 to be effective, or create their own mini databases with their expert knowledge.
    • NLP systems can extract word/concepts within this limited context and look for correlations
    • We don’t expect the system to understand drugs, it just needs to read how the drug works (i.e., which organs or chemical pathways it uses) and look for connections
    • Then present the human doctors with a much shorter list of documents to read

So was it worth it? It’s not about new ideas and changing your mind. It’s more about synthesizing your current knowledge with new elements, rather than supplanting what you know. There were a few new elements, so yes, worthwhile. Other than Pontus Stenetorp, presentations were predictably awful but I can google faster than they can speak. It’s just a indicator: here’s an interesting thing, look it up.

As a networking event I didn’t find it effective, but I didn’t stay long enough to make that happen. I think to get the best out of it I’d have to work a lot harder at saying hello and dragging it out of them.

R and geographic maps

Let’s start with code and a picture:

library(maps)

map(“county”)

 


if you have a bit of population data from the US census, you can do a quick population map easily enough:

(no code because mine has the oddities of my data structures in it)

What’s nice is that these maps aren’t special, they are simply a vast array of points that define the boundaries! Just plotted by brute force.

Improving data by patching with other data sources

I had also some data on locations, and wanted to link them to a county. In 3% of cases there was no county, but we did have a latitude and longitude. A few googles indicated that the lat/long was correct even for the blanks (right on the roof actually…) So let’s patch together the data with a bit of R to fill in the blanks.

What do you mean “in”?

Obvious cases of when a point is inside an area are obvious:

 

Less obvious cases like lakes and offshore islands:

Things do get this bad in geography with the wonderful concept of exclaves or an island in a lake on an island in a lake on an island.

Probably means that there is a whole lot of things to consider when you detect if a point is inside a polygon. Things like ensure you are using the same projection of lat/long and reference points.

Library to the rescue!

Of course there is a library to do that, but they tend to start from a point of view of knowing what you are doing!

So eventually I googled a lot until I found the one that was most similar and simple:

https://gis.stackexchange.com/questions/133625/checking-if-points-fall-within-polygon-shapefile

https://stackoverflow.com/questions/26062280/converting-a-map-object-to-a-spatialpolygon-object?noredirect=1&lq=1

but basically:

  1. Found a library that converts that data structure from maps library into a list of polygons
    1. (polygons are not a first class citizen in R language, but have a stable implementation supported by many libraries)
  2. Found the glue code to stitch together an arbitrary list of lat/longs into a structure that the SpatialPointDataFrame can accept into it’s constructor
  3. Then we are simply using the method over which does all the hard work

When I checked my end results things looked good! When comparing to a test set where I DID know the county in the original data I found that my lat/long lookup was correct in 99% of cases. The 1% left over was revealed on google maps to be within 100m of a county boundary, which feels explicable.. but bad enough that I don’t want to use this patching method unless I have to…

So I think I improved my data from being 3% bad to 0.03% bad.

Code looks like this but YMMV:

(as this was a hacked copy from the internet I’ve got some things that are not part of the tidyverse, which is my recommended style)

# https://stackoverflow.com/questions/8751497/latitude-longitude-coordinates-to-state-code-in-r

# tiny change from original code to do it for county as well

latlong2state <- function(pointsDF) {

# Prepare SpatialPolygons object with one SpatialPolygon

# per state (plus DC, minus HI & AK)

geo_states <- map(‘county’, fill=TRUE, col=”transparent”, plot=FALSE)

#IDs <- sapply(strsplit(geo_states$names, “:”), function(x) x[1])

states_sp <- map2SpatialPolygons(states, IDs=geo_states$name,

proj4string=CRS(“+proj=longlat +datum=WGS84”))

 

# Convert pointsDF to a SpatialPoints object

pointsSP <- SpatialPointsDataFrame(data.frame( pointsDF$longitude, pointsDF$latitude),

data= pointsDF,

proj4string=CRS(“+proj=longlat +datum=WGS84”))

 

# Use ‘over’ to get _indices_ of the Polygons object containing each point

indices <- over(pointsSP, states_sp)

 

# Return the state names of the Polygons object containing each point

stateNames <- sapply(states_sp@polygons, function(x) x@ID)

pointsDF$looked_up_bare_state_name <- sapply(strsplit(stateNames[indices], “,”), function(x) x[1])

pointsDF$looked_up_bare_county_name <- sapply(strsplit(stateNames[indices], “,”), function(x) x[2])

return(pointsDF)

 

}

 

# check it with a test set

test_county_checker <- latlong2state(stores %>%

select(year, store_number, store_name, latitude, longitude, original_bare_state_name, original_bare_county_name) %>%

slice(1:5000))

test_county_checker %>% filter(original_bare_county_name != looked_up_bare_county_name,

original_bare_state_name != looked_up_bare_state_name)