Questions Nobody Asked..

My hat contains a hidden message

2024-03-17T16:49:00+00:00

I shaved my head.

My hair’s been getting thinner for a while now, so I thought I’d try going all in.

I’m still getting used to it.

But one thing I did expect, was that my head would feel colder now. So I needed a hat.

I looked over my save crochet patterns and found this one - Widcombe C2C crochet hat

It uses a crochet technique called ‘corner-to-corner’ (c2c) - effectively the fabric is made up of squares, worked in diagonal rows.

The design as written is okay, but a little generic. I wanted something more meaningful.

The c2c effectively gives us a grid, and in the original design the middle band is a repeating pattern of 6x10 square blocks.

So the question was, what could I do with this?

Well a row of 6 squares gives us 6 bits, allowing us to represent 64 values - more than enough to encode characters of the (latin) alphabet.

Meanwhile we have 10 rows, and it just so happens that my name contains 10 letters - CHRIS OATES

In fact, we only need 5 bits to represent the alphabet, so we have one spare. I considered leaving it blank, or using it to represent uppercase/lowercase. But none of those looked very good, so I just used repeating blocks of 5x10.

The complete design looks like this

And here’s the completed hat

Following the pattern diagonally was… fun.

But at least my head isn’t cold anymore.

Chris.

Hidden messages and an optimal ternary scarf

2024-01-20T18:22:00+00:00

A while ago, I found a pattern for a tunisian crochet scarf that I kinda liked. The design was straightforward - just three solid blocks; I picked 1 blue, 1 grey, 1 cream.

I got maybe 20 rows into it, then got bored. I ended up frogging it.

But now I have these three balls of super-soft aran yarn, and what to do with them?

I wondered if I could make the pattern more interesting by cycling - blue, grey, cream, blue, grey, … - but that’s not much more exciting.

I had the thought that I could completely randomise the colours - for each row, pick one of the three colours at random (roll a dice, even).

I like the idea of encoding information in arts and crafts. In my temperature blanket I encoded temperatures as different colours, and months as white and coloured rings representing binary numbers.

And here I have 3 colours - a base 3 encoding - and, conveniently, three base-3 bits (trits?) can represent 27 values - enough for all 26 letters, plus a whitespace character.

A simple plan

So suppose we want to make a scarf.

We have three colours, which we might map as

-> cream
-> blue
-> grey

The most simple mapping of characters to numbers is to make A=1, B=2, C=3, etc. saving 0 for the whitespace character.

As mentioned, we’re representing 27 characters so we need 3 rows per character.

If we take my name, CHRIS OATES, and translate it to ternary as above, we get

010 022 200 100 201 000 120 001 202 012 201

Or in colours

Not too bad. But doesn’t it feel a little… unbalanced?

Ignoring the space character (000) we have 14 cream, 7 blue, and 9 grey.

Cream (zeroes) are way over-represented.

Every possible 3 bit value is represented in the simple encoding, so in a random sequence of letters, we would expect all 3 colours to appear with roughly equal frequency… except, English isn’t a random sequence of letters.

Popularity contest

Some letters appear in the English language a lot more than others.

So maybe we want to account for this in our encoding to get a more even balance of colours.

For example, if we assigned E as 222 then 2 would end up very much over represented. It would be better to assign E a value with all three bits like 012

The 3 bit ternary numbers can be split into 4 broad groups

6 are one of each digit, e.g. 012
3 are all the same digit (of which 000 is our space)
6 are two the same, split up, e.g. 010
12 are two the same together; 6 left e.g. 001, and 6 rights e.g. 100

So the obvious thing is to assign the 6 one-of-each codes to the six most common letters (E, T, A, S, …), and likewise the 2 all-the-same codes to the two least common letters (Q, Z)

The six two-split codes can be assigned to the 7th-12th most common letters, on the basis that colours clumped together are less pleasing.

How the actual codes within those groups are assigned is largely arbitrary (more on that later).

Here’s one possible encoding ¹

e -> 012   t -> 120   i -> 201
a -> 021   n -> 102   m -> 210

s -> 010   u -> 121   r -> 202
w -> 020   d -> 101   k -> 212

g -> 011   o -> 122   h -> 200
v -> 022   f -> 100   l -> 211

p -> 001   j -> 110   b -> 221
x -> 002   c -> 112   y -> 220

z -> 111   q -> 222

And here’s my name again, using this encoding

112 200 202 201 010 000 122 021 120 012 010

and in colour

Hmm… this looks even less balanced than before.

However, this time we have 11 cream, 9 blue, and 10 grey, not counting the space block. Almost perfectly balanced.

So what’s the issue? The problem this time is that colours aren’t well distributed. We have lots of repeated digits (10, including the white space block), and the blues are biased towards the right.

Going wider

I said in the previous section that how the codes are assigned to letters within each group is arbitrary.

We can do better than that.

Suppose we look at pairs of letters - can we assign codes so as to maintain ‘evenness’ across pairs? And will that make the encoding more even across whole texts? ²

For example, in the above we assigned Q=222. Now Q is (almost) always followed by a U, so it would be foolish to assign e.g. U=220 as we would then get a run of five 2s in a row 222 220

It would be better to give U a code with no 2s, say 101 -> 222 101

We should also take into account the white space character. I want to keep white space pinned as 000, so letters which tend to appear at the end of words should not end with 0s, and letters which tend to appear at the start of words should not start with 0s.

For example, if we assigned Y=100 and D=001, we might suddenly discover a run of 7 zeros - 100 000 001

A more perfect encoding

To figure out the ‘best’ encoding, we need a way to quantify or ‘score’ each possible encoding.

We did this implicitly, above, for the 3 bit codes - i.e. the all-different codes are higher scoring than the all-same codes because they have a higher variety of digits/colours.

Likewise, the two-same-split codes are higher value than 2-same-together codes because the colours are more spread out.

We just need to extend that logic to pairs of 3 bit codes (or equivalently, 6 bit codes) and come up with an empirical ‘score’ function, alike

score(123, 213) = 1
score(111, 111) = 0

For scoring the ‘spread’ of digits, we can

look at each pair of bits
add 1 if different, else add 0 if same
divide by the total number of pairs (5)

e.g.

220 021 -> 22, 20, 00, 02, 21 -> 0 + 1 + 0 + 1 + 1 = 3 -> 0.6

For ‘balance’ we can use entropy with a base of 3

- n0/6 * log3(n0/6) - n1/6 * log3(n1/6) - n2/6 * log3(n2/6)

where n0 is the number of 0 digits, etc. ³

For example, in the worst case scenario, where all the digits are the same, we get

6/6 * log3(6/6) + 0 + 0 = log3(1) = 0

and in the best case, with two of each digit, we get

3 * ( -2/6 * log3(2/6) ) = (3 * -1/3) * log3(1/3) = -1 * -1 = 1

So now we can smash (multiply) the balance score together with the spread score to get a combined score for a given pair of codes, e.g.

s(121, 102) = (4/5) * (log3(6)/6 + log3(2)/2 + log3(3)/3)
            = 0.8 * 0.921
            = 0.736

Now we have to take into account how those codes are assigned to actual letter pairs. That is, it’s better to assign high scoring codes to common pairs (EA) than to uncommon pairs (ZW)

Similar to how we can calculate the frequency of individual letters in the English language - by counting occurences in a text - we can also calculate the frequency of letter pairs in English.

We can then use these frequencies as a weighting, by multipling the letter pair frequency by the score of the assigned codes.

So if we come up with a possible mapping, we can calculate the score for each possible letter pair and take the sum, giving us the total score for that mapping

total score = sum [ f(c_i, c_j) * s(c_i, c_j) ]

For example, in the A=1 encoding, we would calculate

total score = f(a, a) * s(001, 001) + f(a, b) * s(001, 002) + ... + f(z, z) * s(222, 222)

This is our metric for comparing and finding the best encoding.

The best enough

What now?

Here’s the tricky part - we have a method of scoring any given encoding, but how do we find the ‘best’ one?

The problem is, there are 26! (factorial) possible ways of assigning codes to characters - 4 x 10^26 - that’s more than something something in the universe! 🤯

Suffice to say, finding an optimal solution ⁴ is not viable.

Instead, we can look at a heuristic solution.

The approach that worked the best for me was

generate a random starting mapping
for each character in the mapping, find the swap which produces the best score
repeat 2 until the score doesn’t increase anymore
repeat 1+2 multiple times and pick the best scoring result

You can find my code here. I won’t go into detail about the code in this post, as it is already quite long. I have provided code comments.

For my experiments, I grabbed a copy of Frankenstein off Project Gutenberg to generate the bigram frequencies. This gives us 426_160 letter pairs

As the baseline, the scores for the mappings we already discussed are

A=1 -> 0.51581
letter frequency-based -> 0.58679 ⁵

I tried multiple runs and the best score I got was 0.66869

Here’s what that mapping looks like

a -> 210   b -> 122   c -> 022
d -> 121   e -> 021   f -> 221
g -> 011   h -> 012   i -> 101
j -> 100   k -> 001   l -> 020
m -> 220   n -> 202   o -> 120
p -> 110   q -> 222   r -> 201
s -> 212   t -> 102   u -> 010
v -> 002   w -> 211   x -> 200
y -> 112   z -> 111

I tried running it a few times, and it consistently turned up this mapping as the best, which suggests it’s the most optimal. Or perhaps it’s the most optimal solution that this method can generate; maybe a different optimisation method could generate an even better solution. ⁶

Here’s what my name looks like with the optimised mapping

022 012 201 101 212 000 120 210 102 021 212

and here it is in colours

Doesn’t that look so much more balanced?

Here we have 11 cream, 10 blue, and 12 grey, this time including the space block since it was part of the optimisation. It’s almost perfectly balanced, and this time it also has a much better spread of colours; we have only 5 repeated digits (of which 2 are in the space block)

Spinning out

This idea can be extended to different bases. For example, with 4 colours and 2 rows per colour you get 64 possibilities - enough for upper and lower case, 10 digits, 1 space, and 1 left over for a period (or exclamation mark!)

The principle remains the same, you just have more character combinations to deal with.

Similarly, I’ve been talking about a scarf, but this would work just as well for a blanket. For, what is a blanket if not a really wide scarf?

Alternatively, we could construct a blanket of squares - similar to my temperature blanket - with one block per character comprised of 3 bands of colour representing the 3 bits.

This can make for a much more striking design

Tho in the squares case, the criteria for what is optimal is slightly different.

For example, in the above the second and third to last squares are 202 012 which has no repeated digits

But with squares

the last digits ⁷ - the outer rings of 2s - are adjacent.

And that’s only thinking one-dimensionally. For a blanket, we’re going to have a grid, with the text split across multiple rows. How do we account for that?

And notice that in a square the outer ring would use a lot more yarn than the inner ring. So maybe we’d like to ensure that each digit is represented in each position roughly equally - that we use roughly equal amounts of each colour.

Maybe I’ll come back to this idea in a future blog…

Unravelled

At this point, you’re expecting to see a completed scarf?

The truth is, I already used some of the yarn I mentioned to make a sock. Yes, just the one sock. And maybe someday I’ll get around to making it a matching pair.

And maybe someday I’ll even make a ternary scarf.

But first I need to figure out what message is worth wearing around one’s neck…

Chris.

[was this all just a waste of time? you must be new here]

Epilogue: Structure and interpretation of scarf

Suppose you’re presented with a scarf. It’s made up of stripes of three different colours, but the patterns seem… odd. Random? But why would anyone make a random patterned scarf?

Given that there are three colours, you think maybe it’s a base 3 encoding, and you intuit that 3 rows gives you 27 sequences - enough for 26 letters and a space.

You might assume a basic encoding - A=1, B=2, etc. But you don’t know how the colours map to bits 0, 1, 2

But then, there are only 6 possible arrangements, so you can just try them all, and see what yields a meaningful message. Perhaps you notice a particular colour regularly appears in blocks of 3, and you think “maybe that’s a space character”. You call that colour 0 and now you only need to figure out which colour is 1 and which is 2 - two possibilities.

It’s not trivial, but it’s possible. This feels ideal to me.

By comparison, you couldn’t infer the frequency-based or optimal encoding in the same way - in the first one, some of the assignments are arbitrary, and in the second the assignment is heuristic so even following the same procedure you might not reproduce the same mapping.

If the message (scarf) is long enough, you can ignore the base 3 aspect, treat each 3-row sequence as an arbitrary symbol and perform frequency analysis to figure out the letter mapping.

But we’re talking a Tom Baker length of scarf, at least.

Appendix: the example stitches

Each of the examples uses a different stitch/technique. Some may call this an unfair comparison, but it made things more interesting for me ;)

The first example, for the A=1 encoding, was done in tunisian crochet, using repeated tunisian simple stitch (TSS)

The second example, for the frequency encoding, was knitted in a basic stockinette - 1 row knit stitch + 1 row purl stitch.

In this case there are actually two rows per bit (1 knit + 1 purl). This was mostly so the colour change would always happened along the same edge, for my convenience.

The last example, for the optimal encoding, was done in regular crochet, using repeated half-treble (Htr) crochet stitches in UK notation, or half-double (Hdc) in US notation.

I didn’t do an example for the squares because that would have required cutting the yarn (and also I’m lazy).

Footnotes

For the common-ness of letters I copied Morse code, which isn’t strictly accurate to the English language. Tho it should be said that any frequency mapping is not going to be universally correct. It depends on the text it’s calculated from. Tho they do tend to align at the extremes. ↩
Naturally we can extend this to groups of 3 letters, groups of N letters, or whole words. But let’s not get carried away ;) ↩
Fun fact, even tho there are 27 * 27 = 729 possible pairs of 3 digit codes, if we ignore permutations, there are only 7 unique partitions of 6 digits into 3 types, and therefore only 7 entropies to calculate
```
600 -> 0
510 -> 0.410
420 -> 0.579
411 -> 0.790
330 -> 0.631
321 -> 0.921
222 -> 1
```
↩
there are at least two optimal solutions; for any given solution, we can swap the 1s with the 2s and get another mapping with the exact same score. We can’t do the same with 0s since we pinned the white space character as 000, otherwise there would be 6 equivalents for each solution. ↩
as mentioned in footnote 1, the frequency encoding in this blog is based on Morse code. An encoding based on measured letter frequencies scores 0.62144; better than Morse, but still less than ‘optimal’ ↩
This mapping was a close second
```
a -> 120   b -> 211   c -> 011
d -> 212   e -> 012   f -> 112
g -> 022   h -> 021   i -> 202
j -> 200   k -> 002   l -> 010
m -> 110   n -> 101   o -> 210
p -> 220   q -> 111   r -> 102
s -> 121   t -> 201   u -> 020
v -> 001   w -> 122   x -> 100
y -> 221   z -> 222
```
Its score is 4 x 10^-16 less than the ‘optimal’. But notice, if you swap all the 1s for 2s and vice versa in the optimal mapping you get this mapping! This is actually expected, per footnote 4. The difference in scores is probably a floating point rounding quirk. ↩
I think you would call that.. little-endian? I could never remember which is which ↩

Temperature Blanket - Completed

2024-01-10T18:14:00+00:00

A temperature blanket is a crochet or knitting project where one makes a blanket over the course of a year, doing a piece a day with the colours based on each day’s temperature.

For the full background and design of my blanket, see the previous blog post

God laughs

When we last left our temperature blanket, it was the end of July, and I said

we haven’t made it into the oranges/reds […] it would be nice to see those colours incorporated

Turns out I got my wish, but when I least expected it.

Going back to the original layout

the plan was that when I got to the start of September, I would move to the top right, following a truncated z-ordered curve.

But then, come the start of September, we had an unexpected heatwave - including the hottest day of the year! (22.2C)

Following the original plan would have put these yellows and oranges next to the blues of March/April, which felt to me like it wouldn’t look so good.

Back to the old drawing board

The most obvious redesign - instead of shifting to top right, we simply continue downward, but otherwise adhering to the z-order layout.

In retrospect, this feels much more natural.

Doing this, the completed blanket becomes 16x24 squares. This is slightly longer and thinner than the original design, and comes out to 384 squares total.

The original plan had 365 day + 12 months = 377 with 1 spare to represent the year (378 total). In the new plan, I was left with 7 spare squares.

7 is a bit of an ‘odd’ number, so I decided to use one to ‘sign’ the piece, and then use the other 6 as a block for representing the year as a whole.

The signature square is straightforward - it’s my initials, CO. I chose a shade of grey so that it would be cohesive with the overall design, without standing out or being mistaken for any of the other colours used.

The year marker is more interesting. The original thought was to represent the year ‘2023’ in a similar way to the month markers, which are the month numbers in binary.

I hit on the fact that 2023 in hexadecimal is 7e7 which is conveniently 3 digits (and I have 6 squares to fill). It’s also a palindrome, which makes for a nice, symmetrical pattern.

Each square represents 2 bits - an outer ring and the centre, plus a separating ring which is always white. And each vertical pair together represent one hex digit (nibble?)

The bits are read outside-in - so for example, top left is white outer and solid centre = 01, bottom left is solid outer and centre = 11; so all together the pair is 0111 or 7 in hex.

The complete pattern also looks kind of like a smiley face ° □ °

Naturally, the colour represents the average temperature across the whole year (9.9C).

I also broke the z-ordering a little - the December marker (binary of 12) looks too similar to the 10 year square, so I wanted to move it away from the year block so as not to confuse the design.

(If I’d thought of it sooner, I would have put the signature as the bottom-left square of the complete blanket, in the October region)

Wrapped up

Without further rambling, here’s the completed blanket

(rotated 90 degrees, January at top right)

The completed blanket is ~118x170cm

Overall, it wasn’t too difficult on a technical level, but boy did it take a lot of time. Including joining and weaving in ends it took ~ 25mins per square, which comes out at ~157 hours (!) total, or ~3 hours a week.

As for the scarf I mentioned in the previous post; well… the blanket alone was a lot of effort. And I couldn’t muster the enthusiasm to keep working on it. Two year-long projects at once was a little ambitious, oh well.

Chris.

[Now I’ve got to figure out what to do with the damn thing…]

Advent of Code 2023 | jq

2024-01-02T17:52:00+00:00

Advent of Code 2022 was a bit of a miss for me. After getting all the stars in 2021, I didn’t have quite the same drive, and gave up after day 10. Besides which, I had something more compelling to do - making crochet Christmas ornaments.

I wasn’t sure I’d bother at all this year. I was trying to think if there was a way to spice it up - doing it in python is dull, I write python almost every day.

Then one day at work I was doing some API stuff on the command line with curl and jq, and it got me thinking…

jq

I’d wager most developers have heard of jq, the “lightweight and flexible command-line JSON processor”.

Odds are, if you’ve used it you’ve probably not done anything more exotic than picking out fields

curl localhost:8000/auth/token/ -d '{"username": "foo", "password": "bar"}' | jq .access_token

That was most of what I did. Occasionally, I’d try to do something more complex, usually with liberal help from google.

Heck, I’ve seen coworkers use jq to pretty print json, then use grep and sed to grab fields.

Anyway, it seemed like it would be fun to try and do some AoC in jq

Before we get to specific puzzles, lets look at…

General stuff

Not json

The first thing is, of course, that jq is for processing json.

On the other hand, AoC puzzles are rarely (never?) json formatted. Usually the input is lines of plain text.

A little googling tells us the way to deal with this

jq -Rn 'inputs | ...'

The -R flag means “don’t try to parse this as json”. The -n flag is needed for reasons I don’t fully understand.

Then the inputs filter is how you actually get at the input - it’s a generator of lines. Alternatively, you can do [inputs] to get an array of lines

A slight variation is when the input isn’t one-per-line, but multi-line blocks, separated by a double newline.

In that case, we don’t want the input to be split line-wise. So instead we use the -s (slurp) flag to pull in the whole input. We then get the full input with input singular and do our own splitting

... | jq -Rns 'inputs | rtrimstr("\n") | split("\n\n") | ...

rtrimstr gets rid of any trailing new line, otherwise we usually end up with an empty string somewhere down the line, which causes confusing errors

Scripts

If you’ve used jq, you’ve probably used it directly on the command line ... | jq '.[] | .count'

This is fine for simple stuff. But as things get more complex, especially when you start introducing functions, it makes more sense to put everything into a script file. The script file can then be passed to jq with the -f flag - ... | jq -f script.jq

But we can do one better - we can set a ‘shebang’

#!/usr/bin/env -S jq -f

You can use the path of jq directly rather than env if you prefer. -S on env allows us to pass flags to jq

Then it’s just a matter of setting the script to executable (chmod +x), then you can invoke the scripts directly

./script.jq < input.txt

This is what you’ll find in my solutions repo.

Functional programming

One thing I didn’t notice about jq until I started using it in earnest, is that it’s a functional programming language.

My experience with functional programming is all that passes for FP in python, and a failed attempt to learn Haskell. But I’ve picked up a few tricks along the way.

The thing that took some adjusting to is not having a for-loop, but rather having to think in terms of recursion.

Also mutation is not so straightforward. I only used it once.

Assignment

Variables can be assigned in the middle of a pipeline, for example

$ echo '[1,2,3,4,5]' | jq '(length | debug) as $l | debug | add / $l'
["DEBUG:",5]
["DEBUG:",[1,2,3,4,5]]
3

It’s not needed in that example, but you get the idea. It allows you to perform some calculation on the current value and capture that into a variable. It then passes along the original current value unchanged.

This is convenient if a value needs to be reused, or just to make the code more readable

Debugging

The errors from jq tend to be terse, and often not that helpful.

In this case, it’s useful to throw in some debug statements

$ printf "1\n2\n3\n" | jq -Rn 'inputs | debug | tonumber'
["DEBUG:","1"]
1
["DEBUG:","2"]
2
["DEBUG:","3"]
3

Yes, this is akin to putting print statements everywhere. You work with what you’ve got :)

Basically, it prints out the input value then passes the input along unchanged.

There’s a variation where you can pass it a message e.g. debug("hello") but that isn’t supported in the version installed on my laptop.

Documentation

The manual is a bit hit and miss.

For example, the search box is more like ‘jump to heading’

I wanted to find a way to sum an array of numbers, so I searched sum, no match. total, no match. I knew about reduce so I implemented sum with that. Then I was scrolling through the docs looking for something else and spotted add, which was exactly what I had wanted :

Long story short, Ctrl+F and google are your friends.

Formatting

As far as I can find, there’s no standard formatter in the vein of black, gofmt, prettier for jq

So for my scripts I had to go with what felt right to me.

The Puzzles

Day 1

For part 1 we need to pick out the first and last digit from a string, the wrinkle being there may be only one digit present.

Regular expressions are the obvious choice for this task - scan("\\d") - returns an array of digits (strings), from which we can grab the first .[0] and last .[-1] (or indeed ‘first’ and ‘last’)

For part 2, we also have to account for digits written out as letters, and the ‘obvious’ solution is to string replace words for digits, at which point the rest of the solution is the same as for part 1.

The tricky bit is that digit names may overlap, e.g. in eightwothree ‘eight’ is the first digit name, but if we substitute digit names in numerical order, we’d replace ‘two’ to get eigh2three, losing ‘eight’

To workaround this, I had a sudden flash of inspiration while brushing my teeth (as one often does) - what if we substitute the digit numeral, wrapped in its name.

For example we replace two with two2two. When we do that in the example, we get eightwo2twothree. Now we haven’t lost ‘eight’.

The final piece (arguably unnecessary) is solving both parts in one.

As noted, after transforming the input for part 2, it’s solved in the same way as part 1

So for each line, we create an array of [., sub_numbers] ([part1, part2]), find digits, and transpose from

[[line1-part1, line1-part2], [line2-part1, line2-part2], ...]

into

[[line1-part1, line2-part1, ...], [line1-part2, line2-part2, ...]]

then sum up each part.

This transpose trick comes up often.

Day 2

This one looks hard at first glance. The trick is parsing it, with regex and lots of splitting, into the right structure

The key bit is getting an array of [colour, count] pairs into a mapping (object) of {colour: count} using from_entries

Once you have that, the actual solution is straightforward, just applying a couple of functions.

Day 4

An observation which makes this one easier - any given number will only appear once on either side of the |, so we just parse all the numbers in a line into a single array, group the numbers together, then look for the ones there are two of.

$ echo '[1,2,3,4,2,5,1]' | jq -c 'group_by(.) | debug | map(length)'
["DEBUG:",[[1,1],[2,2],[3],[4],[5]]]
[2,2,1,1,1]

(-c means compact format; otherwise the result would be pretty-printed/split across multiple lines)

Part 2 was more interesting. We start with a list of (wins, count) for each ticket. For each ticket we add count to the wins number of subsequent tickets, then return the number of this ticket plus a recursive call, e.g.

  r([(2,1), (1,1), (0,1)])
= 1 + r([(1, 1+1), (0, 1+1])
= 1 + 2 + r([(0, 2+2)])
= 1 + 2 + 4
= 7

Day 5

This is another one I didn’t think I could do. But again, once you get past parsing the input it’s a lot clearer.

Then it’s just raw calculation.

I didn’t manage to solve part 2. I did try, but my solution didn’t scale.

But while we’re on the subject, surprisingly difficult was splitting an array into chunks; I’m surprised there isn’t a built in for it.

The solution I came up with was a sliding window using foreach, which emits the current pair every other iteration.

echo '[1,2,3,4,5,6]' | jq -c 'foreach .[] as $i ([0, null, null]; [.[0] + 1, .[2], $i]; debug | if .[0] % 2 == 1 then ("skipped" | debug | empty) else .[1:] end)'
["DEBUG:",[1,null,1]]
["DEBUG:","skipped"]
["DEBUG:",[2,1,2]]
[1,2]
["DEBUG:",[3,2,3]]
["DEBUG:","skipped"]
["DEBUG:",[4,3,4]]
[3,4]
["DEBUG:",[5,4,5]]
["DEBUG:","skipped"]
["DEBUG:",[6,5,6]]
[5,6]

Day 6

If you write out the formula for distance vs time, what we want to find is (T - x) * x > D or x^2 - Tx + D < 0

In other words, it’s a quadratic equation (inequation?), and we want to find the integer values of x which give a value less 0, which we can get with the quadratic formula

T - sqrt(T^2 - 4D)        T + sqrt(T^2 - 4D)
------------------ <  x < ------------------
        2                         2

To get the count, we take the difference of (the floor of the larger value) and (the ceil of the smaller value), plus one. There’s an edge case where this doesn’t work, when the bounds themselves are integers, as is the case with one of the examples.

But that wasn’t the case in any of my puzzle inputs, so I ignored it :D

Day 7

My first thought, taking inspiration from day 4, was to ‘quantify’ each hand using group_by and length. But how to order them? I wrote out the possibilities

5
4,1
3,2
3,1,1
2,2,1
2,1,1,1
1,1,1,1,1

These are all the ways to partition 5 (not counting permutations). Not that that helps here. But it occurred to me, if I pad them with 0 until they’re all length 5, then they sort in the right order, i.e. 50000 > 41000 > 32000 > 31100, etc

But what about the values of the cards themselves? As they are, they’re not sortable because e.g. king is higher valued than queen, but K is less than Q lexically.

The dumb solution I came up with was to translate the face cards into their equivalent hex value, i.e. T -> A, J -> B, etc.

We then concatenate the hand type with the hexified cards to get a ‘canonical’ form, e.g.

32T3K -> 2111032A3D
T55J5 -> 31100A55B5
KK677 -> 22100DD677
KTJJT -> 22100DABBA
QQQJA -> 31100CCCBE

Then finding out the ‘power’ ordering is a simple lexical sort

For part 2, we count the Js, quantify the hand without them, then add the J count to the largest of the remaining groups, e.g.

KTJJT - > 2 + KTT -> 2 + (2,1) -> (4,1)

and when converting to hex we replace J with 1 instead of B

Then the rest works the same as part 1

pad is another function which is surprisingly absent from jq. Additionally, the repeat method is unbounded. So I used range + foreach to generate an array of 5 zeros, then zipped (transposed) that with the input, which pads the input with null

$ echo '[1,2,3]' | jq -c '[., [foreach range(5) as $i (0; .)]] | debug | transpose'
["DEBUG:",[[1,2,3],[0,0,0,0,0]]]
[[1,0],[2,0],[3,0],[null,0],[null,0]]

then use max to take advantage of the fact null is less than any other value.

Day 8

Part 1 is a fairly straightforward parsing of a tree structure into an object - from_entries is our friend - then a recursive walk for the length.

Part 2 is a classic AoC trap. You try to play it out, then realise it’s going to take forever for it to complete executing that way, and actually the different paths are looping, so you just need to find when the loops coincide.

For that we need to calculate the lowest common multiple of the loop lengths, and to my great shame I had to look up the formula on wikipedia (probably the last time I did an LCM was AoC 2021).

Day 9

To get the pair-wise difference, we ‘zip’ the input with itself offset by one

$ echo '[1,2,3,4,5]' | jq -c '[.[:-1], .[1:]] | debug | transpose'
["DEBUG:",[[1,2,3,4],[2,3,4,5]]]
[[1,2],[2,3],[3,4],[4,5]]

Otherwise it’s just implementing the procedure as described in the puzzle.

Day 12

This one I did by brute force - that is, replace each ? with a # or . and see if it matches the pattern.

It was slow - took something like 10mins - but it got there in the end. And more to the point, it was easy to implement.

It did not, however, scale for part 2. I didn’t even bother trying, given how long part 1 took.

Day 13

For this, finding the horizontal reflections didn’t seem too bad - slice each line, does the first half match the reverse of the second half.

But what about vertical reflection?

Then I remembered the trusty transpose function, which turns the vertical problem into the horizontal problem again. Easy.

Part 2, not so much.

Day 15

For this we have the handy explode function, which turns a string into an array of ‘code points’, which are conveniently the same as ASCII values (yay, unicode)

Day 24

Finding where (and when) the paths collide can be found algebraically.

Having said that, translating said algebraic solution into jq was horrendous. Seriously, that script should come with a content warning :p

jq-wise, we have the convenient combinations function to generate all the pairs of hailstones. You just need to remove self-pairs and reverse pairs

$ echo '[1,2]' | jq -c '[. ,.] | combinations'
[1,1]
[1,2]
[2,1]
[2,2]

i.e. [1,1] is a number paired with itself, and [1,2] and [2,1] are the same just in opposite orders.

Conclusion

Well, it was fun while it lasted. Given jq is Turing complete [citation needed], it is theoretically possible to solve all the days with jq, but I’m afraid that’s beyond my skills/determination.

In the end, I got 19 out of 50 stars (38%), which I’m pretty sure is a failing grade. Oh well :)

The sad thing is, having learned all this jq, I’ll probably never use it professionally. Anything which requires more complex jq processing than what will fit on a single line would just raise the question - why not write it in python instead? After all, python is more readable and more testable.

To that point, as I was writing this blog I was looking back at my solutions and thinking, “erm.. how does this work again?”

Still, a fine way to pass the time before Christmas.

Chris.

[And I did also find the time to crochet more Christmas decorations]

Temperature Blanket

2023-07-30T16:15:00+00:00

Background

A temperature blanket is a sort of physical infographic. It’s a year-long knitting or crochet project. You do a piece of the blanket every day, with the colour based on that day’s temperature (weather).

Within that basic concept, there’s a lot of room for variation — squares or stripes, different stitch patterns etc.

In my case, I chose crochet squares.

Research

To get a sense of the distribution of temperatures — and by extension colours — I wanted to look at temperatures over the course of the previous year (2022). This was surprisingly hard to find. But eventually I came across Visual Crossing.

I loaded the data into Google Sheets to explore

These are daily average temperatures.

As we can see, the temperatures are roughly normally distributed, average around 10C, skewing towards lower temperatures.

Colours

I decided to split the range into 2 degree steps between 0 and 30 C, plus <0 and >30 as singular groups.

Like any knitter/crocheter, I have way too much yarn (my ‘stash’ in the parlance). So I looked at this as an opportunity to use up some of that yarn. I started by laying out the balls I had the most of, arranging them in a rough spectrum: blue→green→yellow→red

I started out with a fairly even split of each colour, but decided to skew them towards blue — I wanted to use blues for the temperatures which ‘feel’ cold (<12C), green for ‘comfortable’, yellow for warm, and red for hot.

I ended up making a couple of adjustments as I was going. I added an extra blue when the temperature dipped below -2C. I also dropped one of the yellows because, when we had a heatwave, it didn’t reach the 3rd yellow despite feeling very warm; I wanted to get to the oranges/reds sooner.

Design

I decided to do squares, that felt like it would be less tedious than long rows of repeated stitches. It also allows for doing squares on their own, which became important when I ran out of the white joining yarn.

The squares are done in lichen stitch, as described here.

I was originally planning on doing 14x26 (364 total), but that comes out relatively long and thin.

Then I saw someone else’s pattern doing 18x21. That comes out at 378 squares, or 365 (one per day) + 12 (one per month) + 1 (one for the year).

For the month markers I’m representing the month number in binary: excluding the centre and border there are 4 rounds in a square, so can represent 0-15 (or 1-12 in this case). The color is taken as the average temperature for the whole month.

I haven’t decided what to do for the final, year marker yet…

The layout of the squares follows a Z-order (Morton) Curve - a space filling curve which ‘preserves locality of data points’. In other words, consecutive days are always placed (relatively) close together; more so than the typical left-to-right top-to-bottom layout.

Below is a diagram of the layout, which I also use for tracking progress.

Bonus: Temperature Scarf

While I was exploring the data for the previous year, I also looked at the temperate change from day to day.

The interesting thing here is, while the temperature increases and decreases slowly over the course of the year creating a smooth(ish) gradient, the size of the temperature changes vary a lot more from day to day.

For this, I decided to do a scarf, since another blanket would be too big of a time commitment.

Specifically, I decided to use tunisian crochet, with one row per day. The pattern is 42 stitches, alternating tunisian simple (TSS) and tunisian purl (TPS).

For positive (increasing) changes I start the row with TSS, and for negative (decreasing) change I start with TPS. The effect is that when there are consecutive days of all increasing (or all decreasing) changes, there is a clear line along the fabric.

For the colours, I chose shades of brown, since I have a lot of them. Once again, I laid them out and arranged them in a way that looked good. But you’ll notice the colours don’t strictly gradient from light to dark

Progress

This is what the blanket looks like up to the start of July

The look of it calls to mind the GitHub activity ‘heat map’ (appropriately enough).

We’re at the end of July at the time of writing, and we haven’t made it into the oranges/reds. On the one hand, it would be nice to see those colours incorporated. But on the other hand, I really don’t want it to get that hot 😬

I’m a lot more behind on the scarf, but it looks like this up to early April

Anyway. Quite time consuming, but not a bad way to pass the time while watching TV.

If you have a ravelry account, you can follow my progress here.

UPDATE: the completed blanket

Chris.

[spot the heatwave]

When a Random Sample is as Good as the Whole Thing

2022-10-28T23:19:35+00:00

Background

At a high level, what we have is a Django-based server, with celery for running tasks.

A user submits a workflow, which is translated into a ‘canvas’ of tasks. The tasks are run in a celery worker, often on a different node to the server, and with no access to the server database.

As tasks are running, they may post ‘in progress’ updates to the celery result backend. To pull those updates into the server database (and make them visible to users), we have a periodic background task which poll results from celery and updates them into the server DB.

This worked well for a while, but as the scale of jobs increased, so did the time taken to perform the refresh operation.

As an example, we had a job of ~280K tasks, which took in excess of 3 hours to refresh. This is less than the time taken for the job to complete, but it does mean we don’t get timely updates - it appears to users as if the job is ‘stuck’, not progressing.

Optimisation

The logic goes like this: each time the refresh operation runs, some number of tasks will be newly completed, call it n. This number is on the order of 1-10 task per second. So if the refresh runs once a second, we’re refreshing +100K tasks, of which only ~10 (0.01%) have changed state.

This is clearly inefficient.

So what if instead we pick out a batch of, say, 1,000 tasks to refresh each time? That seems like it wouldn’t work - if we take 1,000 tasks out of a pool of 100K, what are the odds that any of them will change state?

Maths

Lets define some variables. Tasks can be divided into 3 overlapping states

U: Unknown state, the task may be pending, it may have run and just not be refreshed yet
C: Completed, the ran to completion and was refreshed
R: Ran, but hasn’t been refreshed yet (these tasks also belong to U)

So if we have U tasks of unknown state, and we take one at random, the probability it is in state R is R/U. If we take a sample k of tasks, then the number we expect to update to completed is k*R/U.

So in each refresh, we expect the number of complete tasks to increase as

C' = k * R(t) / U(t)

The (t) is because these are functions which vary over time

What are R and U?

Well U is the total number of tasks N minus the number of tasks which are completed C

U(t) = N - C(t)

R is the number of tasks which have run up to that point, minus the number of those which have been marked as completed. If we assume a constant task throughput as previously mentioned (n)

R(t) = n*t - C(t)

Combining these 3 equations, we get

C' = k * (n*t - C)/(N - C)

What now?

Well, the refresh is operation complete when all the tasks have run and been refreshed - that is, when C(t) = N

Substituting that into the above equation and doing a little algebraic slight of hand, we get

t = N / n

which is the time at which we expect the refresh operation to complete.

Interpretation

The above result is interesting for two reasons.

First, notice that it doesn’t depend on k, the sample size. In other words, even if we take a bigger random sample to refresh at each step, the operation will still take the same amount of time to complete.

Second, if we consider the extreme case - where we refresh every task each time - n tasks complete at each refresh and all n of them will be marked as completed. So the time to complete would be total number of tasks divided by n -> t = N / n

The exact same as random sampling!

This feels wrong. And yet, a simulation shows it to be true.

The above graph shows the number of tasks in unknown state U over time. Initial population N=100,000, job run rate n=40, and random sample size k=1,000

Sure, the random sampling is slower to update progress at the start. But as the number of tasks waiting to be update (in state R) accumulates, the hit rate for random sampling increases, ultimately converging on the same completion time as refreshing all.

And going back to the original problem, refreshing 1,000 tasks is significantly faster than refreshing +100K, which gives us more frequent updates. So in a practical sense, the random sampling may actually report progress faster than refreshing all!

Conclusion

Honestly, the random sampling felt like a hacky workaround.

The fact that it works as well as refreshing all - proved, mathematically - is nuts and pretty damn cool.

We are working on a replacement for our ‘job engine’, which may make all this moot. But for a stopgap, it’ll do for me.

Also, apologies for the code-block equations. I can’t figure how to get mathjax working 😕

Chris.

[zero division? never heard of her]

A Data Structure for Pokemon Cards

2022-05-28T23:19:00+00:00

Spring cleaning

Around this time of year, I get an urge to clean out my room, get rid of junk, old books and DVDs. One thing I did this year was to throw out all the boxes from wardrobe: the boxes for old technologies — laptops, phones — some things I don’t even own anymore.

Anyway. In clearing out my wardrobe, I rediscovered my Pokemon cards.

Well, more accurately, I always knew where they were. They were just hard to get at, buried under all those boxes.

A wild problem appeared

My old Pokemon cards were in a ring binder, filled with card display sheets — nine slots per page, double sided (18 total).

Now, every Pokemon has a number, and I want to arrange my Pokemon cards in numerical order.

The most simple arrangement is to assign one slot for each Pokemon. This what I had done when I put my cards away roughly 20 years ago. Of course, back then, that was a much more reasonable proposition; there were only 151 Pokemon.

Now there are almost 1,000!

For that I would need around 55 display sheets, not impossible, but probably a tight fit (granted, I could buy a second binder). But more to the point, there are going to be A LOT of empty slots. Surprisingly, of the original 151, I was only missing one Pokemon: Nidoking.

And what I’ve ignored up to this point is variants — there can be multiple, different cards for the same Pokemon. I even have some different versions of the same card: shiny and non-shiny, English and Japanese (and a couple French for some reason).

For my 150, I had just crammed all the variants into a single slot. But I want to see my cards. So ideally I want each variant to get its own slot.

What I’ve described above — one slot per Pokemon — is like a fix-length array, the sort of thing you have in C. This works fine if you have a fixed number of items, if all (or most) of the slots will be filled, and if you have the storage capacity to allocate all those slots.

(Putting variants into the same slot is like having an array of stacks, but lets not complicate things).

Everything changed when the booster attacked

Okay, so forget one slot-per-Pokemon. The simplest thing is to just put the cards into the album, in order, without gaps. Easy-peasy.

But then, in a bout of nostalgia, I bought some booster packs.

I want to put these new cards in the album, but I want to maintain the numerical order. So I have to insert a new card in the correct slot, and move everything after it along one slot.

What if I get a new Bulbasaur variant (Pokemon #1)?

Then I’d have to move every one of my hundreds of cards along one slot. That would be a massive, tedious hassle.

In computer science, if we wanted efficient inserts while maintaining order, we might use something like a linked list, or a binary tree, or such like. But those structures don’t really map to our album of pages of slots.

We could do like a database — insert the cards in arbitrary order, and keep an index of where each one is. But that’s not really what we’re going for; we want to be able to browser the cards in order. An ordered scan of an index would involve a lot of jumping around.

With our desired arrangement, inserting a new card is always going to require moving things around (unless the new card happens to go in the very end of the album). All we can do is try to minimise the number of ‘move operations’.

Paging

How about this: when we want to insert a new card, first insert an empty page after the page where we want to insert the card. Insert the card, and move everything along one, overflowing onto the newly inserted, empty page. With this, we do, at most, 18 moves per insert.

Actually, lets refine that slightly. If we do multiple inserts, we don’t want to be inserting pages willy-nilly.

So lets say, if the page we want to insert into (or the one after) has a free slot, insert the new card and move everything along as appropriate. Otherwise, insert a new page.

Another tweak is, if the card is inserted on the first side of the page, insert the new page in front instead of behind.

This method reduces the number of moves significantly, but could lead to lots of empty slots and pages that contain only one card. Not ideal.

But perhaps we can have some ‘compaction’ process — that is, if I find myself bored on a Sunday afternoon, I can move all the cards up to eliminate the empty slots.

(Anyone who had a Windows PC in the 90s will likely be familiar with defragmenting).

Can we do better?

Divide and conquer

In the previous scheme, we insert a page and move the cards down one slot, leading to a page with a single card. What if, instead, we insert a new page, and move half of the cards to the new page. Then, insert the new one and move around as necessary.

As before, we have a maximum of 18 moves per insert, but now every page is at least half full. This, I think, is much more aesthetically pleasing.

And as before, we can do compaction as necessary.

Swiss cheese

What about this — instead of putting a card in every slot, we put a card in every other slot.

Half of the time (ish) we can just insert a card without moving anything. If the slot we want to put the card in is occupied, we move things up; and because of the defuse arrangement of cards, we won’t have to do much moving up until we hit any empty slot.

And if a single page gets (too) full, we just insert a new page, and spread that full page out into it.

This, again, is more aesthetically pleasing than a mostly empty page.

What’s the average moves per insert? Answers on the back of a postcard.

Of course, the main drawback is that the album is, on average, 50% full. Which isn’t very efficient.

Placement

Up to this point, I’ve assumed that if there are multiple free slots that I card could go in, I would choose the first.

For example, if we have 1, E, E, E, 5 and want to insert 3, we would probably put it in the first available slot - 1, 3, E, E, 5

But if we later want to add 2, now we have to move the 3 we already placed.

So an improvement to this would be to instead place the 3 in the second empty slot - 1, E, 3, E, 5 - then when we come to insert 2, we don’t have to move anything.

In real life, the cards aren’t likely to be so evenly spaced out, but we can at least place the card at a relative position, i.e. 3 goes roughly half way between 1 and 5.

Possible downside to this approach is it will tend to scatter the cards, which may look messy. On the other hand, it may look better to have them more evenly spaced.

It’s also possible that it would require more moves when compacting.

Algorithm, I choose you!

The approach I ultimately went with was insert-and-divide.

I don’t have any evidence to support this being the superior technique. It just feels best. I’m tempted to do some simulations to put some actual numbers one it (watch this space…)

In any case, it seems to work well enough. Tho one thing I didn’t anticipate was how hard it is to open and close the binder rings.

Chris.

[the diagrams are only 4 slots because I’m lazy, okay]

Advent of Code 2021, Day 1 in 52 Chars of AWK

2021-12-02T22:40:00+00:00

Here is a solution to day 1 (both parts) in 52 characters of AWK

{a+=($0>x);b+=($0>z);z=y;y=x;x=$0}END{print a-1,b-3}

Okay, this is going to take some unpacking…

Background

This is my 4th year doing Advent of Code. For the last couple of years, I used rust, but this year… I dunno, maybe the novelty’s worn off.

Last year, I did a few of the early puzzles as Python one-liners. For this year, I wanted to see if I could do any as bash one-liners. But for day one, it ended up as just an awk one-liner, which I code-golfed down to 52 characters.

To explain this, I’m going to start by deriving a solution in python, then I’ll explain how this translates into awk. I think it’ll be clearer that way.

Problem

You can see the full problem description here, but it boils down to counting the number of times values increase in a sequence of integers. Then in part two, it’s mostly the same, except using a sliding, three-value window.

Python

Let’s start with a basic solution to part one in python

count = 0
for i in range(1, len(items)):
    if items[i] > items[i-1]:
        count += 1

Simple enough.

In fact, we can take advantage of that fact that booleans are basically just 1 and 0, alike so

count = 0
for i in range(1, len(items)):
    count += (items[i] > items[i-1])

Or if we really wanted, we could compress this down to a one-liner

sum((items[i] > items[i-1]) for i in range(1, len(items)))

But that’s not what we’re trying to do this year. This year we’re doing awk!

Okay, now let’s look at part 2 - one solution would be

count = 0
for i in range(len(items)-3):
    if sum(items[i:i+3]) > sum(items[i+1:i+4]):
        count += 1

But actually, we can simplify this a little

Consider we have values A B C D

To compare the first window of 3 values to the next window, we compare: (A + B + C) < (B + C + D)

Notice we have B + C on both sides of the comparison. So we can cancel them out to give us just: A < D

Now, back to python

count = 0
for i in range(3, len(items)):
    if items[i] > items[i-3]:
        count += 1

This looks a lot like our solution to part 1!

Indeed, we can combine them into a single loop

part1 = part2 = 0
for i in range(len(items)):
    if i > 1:
        part1 += items[i] > items[i-1]
    if i > 3:
        part2 += items[i] > items[i-3]

The first if isn’t strictly necessary, but I’ve included it for symmetry.

Alright, I think we’re ready to switch to awk

Awk

It turns out awk is actually pretty good for Advent of Code (or the easy ones at least).

Awk is a stream processing language (like sed). You give it a file, and it loops over the lines of the file doing some processing. Considering a lot of the AoC inputs are texts files with one ‘value’ per line, this works quite nicely.

The drawback is, it only gives us access to the ‘current’ line; we don’t know what the next line is until we receive it. And we only know the previous line if we store it.

Let’s start small, with just part 1

NR > 1 {count += ($0 > prev)}
{prev = $0}
END {print count}

Okay, there’s a few things to explain here

NR is a builtin variable, it’s effectively the current line number (starting from 1)
variables in awk are automatically initialised - so the first time we reference the variable count, it’s created and set equal to 0
awk splits lines into column, based on whitespace. The first column is assigned to variable $1, the second to $2, etc. and the variable $0 represents the entire line
numerical columns (or indeed a whole numerical line) is automatically converted to integers
as with python, bools are equivalent to 1 and 0
the END block runs only after all the input lines have been processed

In python, the above would be equivalent to

# implicit variables
count = 0
prev = 0

# main loop
for NR, var0 in enumerate(items, 1):
    if NR > 1:
        count += (var0 > prev)
    prev = var0

# END
print(count)

Now, part 2 is similar, but a little messier, since we need to keep track of three prior lines

NR > 3 {count += ($0 > prev3)}
NR > 2 {prev3 = prev2}
NR > 1 {prev2 = prev1}
{prev1 = $0}
END {print count}

At this point, you can perhaps see how we would combine the two solutions

NR > 3 {part2 += ($0 > prev3)}
NR > 2 {prev3 = prev2}
NR > 1 {
    part1 += ($0 > prev1);
    prev2 = prev1
}
{prev1 = $0}
END {print part1, part2}

Easy-peasy

Golfing

Now we can start getting rid of redundancies.

First, recall that variables are auto-initialised. So for this line

NR > 2 {prev3 = prev2}

it doesn’t matter that prev2 isn’t set for the first two iterations

NR > 3 {part2 += ($0 > prev3)}
NR > 1 {part1 += ($0 > prev1)}
{
    prev3 = prev2;
    prev2 = prev1;
    prev1 = $0
}
END {print part1, part2}

But what if we dropped the conditionals completely?

{
    part2 += ($0 > prev3);
    part1 += ($0 > prev1);
    prev3 = prev2;
    prev2 = prev1;
    prev1 = $0
}
END {print part1, part2}

Okay, suppose we have values 1 2 3. For the first 4 iterations, at the point when we update part1 and part2, the ‘prev’ variables are as follows

(NR=1) prev3 = 0; prev2 = 0; prev1 = 0
(NR=2) prev3 = 0; prev2 = 0; prev1 = 1
(NR=3) prev3 = 0; prev2 = 1; prev1 = 2
(NR=4) prev3 = 1; prev2 = 2; prev1 = 3

Now we can be a bit cheeky, and observe that none of our puzzle input values are 0.

So for the first iteration, $0 > prev1 will always evaluate to 1 (true). Similarly, for the first 3 iterations, $0 > prev3 will also always evaluate to 1 (true).

All this means is, we have to apply a ‘correction’ to the final results: print part1 - 1, part2 - 3

After that, it’s just a matter of reducing all the variables to single characters, and dropping all the whitespace

{a+=($0>x);b+=($0>z);z=y;y=x;x=$0}END{print a-1,b-3}

Voila! 52 characters.

Bonus: Day 2

This one worked out quite nicely too - both parts in 57 characters

/f/{x+=$2;y+=a*$2}/n/{a+=$2}/u/{a-=$2}END{print x*a,x*y}

The trick here is using a regex match - /pattern/ - on a single character which is unique to each instruction - f for forward, u for up, and n for down (since d o w also appear in forward)

In terms of solving both parts, we note that the x distance is calculated the same in both parts, and the a(im) in part 2 is calculated the same as the y direction was in part one.

Bonus: Day 3

I only managed the first part for this one, and it clocks in at a chunky 92 characters

BEGIN{FS=""}{for(i=1;i<=NF;i++)a[NF-i]+=$i}END{for(k in a)2*a[k]>NR?g+=2^k:e+=2^k;print g*e}

Setting ‘field separator’ (FS) to empty allows us to treat individual characters as columns. NF is another of the builtin variables - in this case the number of columns (fields). We use an array to sum the bits down each column; as with variables, the array is initialised on demand.

Then, in the END block, we compare each column sum to the the number of lines - 1 is most common if it appears in more than half the lines. We use that to convert the sums array to a decimal number (gamma) and its compliment (epsilon).

Part 2 requires multiple passes over the puzzle input, so even if I came up with a solution, it would probably exceed 100 chars. I certainly couldn’t combine it with the solution to part 1.

Bonus: Day 5

This one’s a bit of a monster - 183 characters

BEGIN{FS=" -> |,"}{x=$1<$3?1:$1>$3?-1:0;y=$2<$4?1:$2>$4?-1:0;while($1!=$3+x||$2!=$4+y){a[$1,$2]+=$1==$3||$2==$4;b[$1,$2]++;$1+=x;$2+=y}}END{for(k in b){t+=a[k]>1;u+=b[k]>1};print t,u}

I think this would take a post all of its own to explain…

Bonus: Day 6

170 characters

BEGIN{RS=","}{a[$1]++;t++}END{while(i<256){u=i++==80?t:u;for(k in a){n=a[k];if(k!=0)b[k-1]+=n;else{t+=n;b[6]+=n;b[8]+=n}}delete a;for(k in b)a[k]=b[k];delete b}print u,t}

This one felt like I was really fighting awk. Annoyingly, it doesn’t let you assign arrays to other variables.

Here, we’re using RS (record separator) to split the comma-separated inputs into lines. If we used FS, we’d have to do a for-loop.

Note, deleting an array with delete a is not supported in all awk implementation. Also the part 2 answer is so large, it needs to be run with -vOFMT="%.0f" to get it in non-scientific format.

Bonus: Day 7

152 characters

BEGIN{RS=","}{a[NR]=$1;m=$1>m?$1:m}END{t=NR*m;u=t*m;for(;x<m;x++){y=z=0;for(k in a){d=x-a[k];d=d<0?-d:d;y+=d;z+=d*(d+1)/2}t=y<t?y:t;u=z<u?z:u}print t,u}

At this point I’m putting more effort into coming up with contrived ways to drop characters, more than actually solving the day’s problem.

Of note, awk lets you do chained assignment - e.g. y = z = 0 being equivalent to y = 0; z = 0. Notice also we can drop the initialisation from the for-loop - for(;x - since an undefined variable is initialised to 0



Part 2 mostly the same as part 1, but using triangular numbers

Bonus: Day 8

This one is only part 1, but much simpler than the previous few days - 49 characters

{for(i=12;i<16;i++)t+=length($i)%5>1}END{print t}


Pretty straightforward.

What’s interesting is using modulo to squash a bunch of case checks. In essence, the task is to count the number of strings of length 2, 3, 4, or 7. From the puzzle description, we note that the only other possible lengths are 5 and 6. So, if we take the length modulo 5 – 5 and 6 become 0 and 1, and 2, 3, 4, 7 become 2, 3, 4, 3, respectively. Hence, length % 5 > 1

Bonus: Day 10

By far the longest, but still fits in a tweet - 257 characters

BEGIN{FS=""}{o=0;for(i=1;i<=NF;i++){w=index("([{<)]}>",$i);if(w>4){if(s[o]==w-4)o--;else{a+=w==5?3:w==6?57:w==7?1197:25137;o=0;break}}else s[++o]=w}t=0;while(o)t=5*t+s[o--];if(t){for(j=0;j<n;j++)if(t<b[j]){x=b[j];b[j]=t;t=x};b[n++]=t}}END{print a,b[n/2-.5]}


The problem involves bracket matching. Initially, I thought this would not be possible in awk with its ‘so-called’ arrays (acutally mappings), which lack ‘push’ and ‘pop’ operations. But I had a moment of inspiration, and realised one can store items by an incremental index and use a ‘pointer’ to track the head of the ‘stack’ and simulate popping.

Also buried in there somewhere is a little insertion sort, since part two requires finding the middle value in a list (never easy).

Bonus: Day 13

I got a solution for part 1 using 89 chars of awk, by using a cheeky tac (110 chars total)

tac "$@" | awk -F, '/=/{split($1,a,"=");d=a[2];i=$0~/x/?1:2}NF>1{d<$i?$i=2*d-$i:0;t+=!s[$1,$2]++}END{print t}'


I have a pure awk solution to both parts, but it’s 247 chars (you can see it in my GitHub)

For the above, I’d especially like to highlight t+=!s[$1,$2]++, just for the fact it’s doing, like, 3 things at once.

Bonus: Day 17

Finally, my physics degree pays off! (part 1 only, 36 chars)

BEGIN{FS="=|\\."}{print $5*($5+1)/2}


Summary

I want to be clear that I’m no awk expert; I only dabble by way of using bash for work. Maybe someone else can golf these down even further.

If you want to actually run these awk commands, you can save them to a file, and run them like

awk -f day1.awk input.txt


Or directly from the cli like so

awk '{a+=($0>x);b+=($0>z);z=y;y=x;x=$0}END{print a-1,b-3}' input.txt


Anyway, if I come up with any good ones for later days, I’ll update this post with them.

You can also find all my solutions (including python ones) in my GitHub repo.

Chris.

[pretty AWK-some… am I right? guys? where are you going?]



How I Approach Software Design
2021-08-29T17:29:00+00:00
What I wanted to talk about today is my approach to designing a piece of software, by looking at three projects I’ve worked on ¹

But before we get to the example, lets lay out the approach


  find the ‘core concept’


that is, when you strip it down to the most basic level, ignoring the implementation details, the edge cases, the error handling or whatever, what is the basic mechanism by which information flows from an input to a result. When you find this core concept, it’s usually so basic that it feels self-evident, as you will see.


  figure out what changes or may change


for example if you needed to handle a different input, support a different (3rd party) service.


  figure out how those two pieces fit together


That is where the magic happens.

So, lets look at some examples

Event Sync

This is a project I’ve alluded to in previous blog posts

We have a cloud storage service. When you perform some action (create a file, delete a file) it generates an event, which we can poll using their API. So what we want to do is sync changes to a local file system by fetching events and performing the corresponding changes (download a file, delete a file).

In terms of core concept, this one perhaps the most obvious - it’s event driven. We have an event source and an event consumer.

As for what changes; whenever I’m working on a project which interacts with a 3rd party service, the question in the back of my mind is always - what if we want (or need) to switch out that service for another one? Suppose we want to support a different (or additional) storage service?

So we have the official SDK, which generates Event objects. But those objects are specific to that SDK. With an eye on being able to replace the service, we don’t want to deal with those SDK events directly. Instead, we define our own, generic Event type.

@dataclass
class Event:
    type: EventType
    path: Path
    timestamp: datetime
    # etc


What this means is, the event consumer doesn’t care where the events come from, they look the same in any case ²

Similarly, we can switch out the event consumer so that instead of reflecting changes on a file system, it could reflect them on a different storage service (cloud-to-cloud sync). Again, the event source doesn’t care what happens to the events it emits.

Another nice things is we can stick some filtering in the middle. As long as the filtering takes in generic Events and spits back out generic Events, neither the source nor the consumer even need to know it exists.

ACL tool

Access Control List (ACL) is basically the permissions on a file. If you’ve deal with linux you’ll be familiar with user/group/other read/write/execute. In fact, you can assign permissions to specific users or groups using e.g. setfacl.

So we wanted a tool like setfacl but for a different file system type (one which setfacl doesn’t work on) - a tool for adding or removing permissions for a given user or group.

The basic concept here was


  get the current ACL
  perform some transformation on the ACL
  put the updated ACL back on the file


Step (2) is an obvious thing that changes - this is where we add or remove entries from the ACL.

But we can also make changes to (1) and (3)

For example, instead of reading the ACL from the file, in (1) we can read it from the command line (or stdin), allowing us to copy the ACL off of another file.

Similarly, instead of writing the updated ACL to the file, in (3) we can print the updated ACL (to stdout). This gives us ‘dry run’/test mode functionality.


if args.test:
    putacl = print_acl
else:
    putacl = save_acl

def update_acl(path):
    acl = getacl(path)
    acl = updateacl(acl)
    putacl(path, acl)

for path in treewalk(root):
    update_acl(path)


Workflow engine

In this case, I was refactoring some else’s code. What is does is send files from one machine to another by uploading to cloud storage from one machine and downloading on the other ³. This is orchestrated using Celery, with the two machines sharing a message queue.

What changes is the cloud storage service used as the intermediate - it may be Google cloud (gcs), Amazon (aws), etc.

The original looked something like this

def send(provider_type, files):
    if provider_type == 'gcs':
        wf = chain(
            upload_to_gcs.s().set(queue='local'),
            download_from_gcs.s().set(queue='remote')
        )

    elif provider_type == 'aws':
        # basically the same thing but with aws tasks
    # etc

    res = wf.apply_async((files,))
    res.get()


A big switch-alike statement like this is definitely a code smell

One solution to this might be to define separate send_via_gcs, send_via_aws, etc. tasks. But another thing that changes is the workflow itself - that is, we want to support additional workflows, such as download-only for a file which is already in the cloud, or to delete files from a remote machine. Separate workflow functions for each provider doesn’t scale well.

So the core concept here was building up a workflow (chain) of these tasks. The thing that changes is what cloud provider the tasks interact with. Traditionally, we might do something like this with classes and polymorphism, but that doesn’t really work for celery tasks… or does it?

Consider

class GCSProvider:

    def upload(self):
        return upload_to_gcs.s()

    def download(self):
        return download_from_gcs.s()


def send(provider, files):
    wf = chain(
        provider.upload().set(queue='local'),
        provider.download().set(queue='remote')
    )

    res = wf.apply_async((files,))
    res.get()


For extensibility, a new provider just needs to implement the Provider interface, or we might adds new tasks (methods) to the interface. And as long as the interface is consistent across all providers we can (and did) define new workflows which will work with any of the providers.

Conclusion

So that’s all there is to it - what’s the core concept, what’s changeable?

Once you have that down, everything else is just building out - adding flesh to the bones.

This approach perhaps doesn’t apply in all cases. In some cases, it may already be laid out for you (e.g. Django). But if you’re starting on a brand new project with no existing framework, these are good questions to keep in mind.

Chris.

Footnotes


  
    
      these are work projects, so I have to be vague on some of the details. ↩
    
    
      this is also good for testing, since we can test the source and consumer independently ↩
    
    
      it’s a little more nuanced than that, but again, NDA ↩
    
  



Persistent dict with SQLite
2021-07-11T16:48:00+00:00
SQLite

This is a pattern I’ve used a couple of times now.

Basically I needed to store some key-value type data — like a dict in python — but it needed to be persistent, so that if the program is stopped and restarted, the data is retrievable.

In particular, I wanted something light-weight and with no extra dependencies. I’m not dealing with a lot of data, so redis would be overkill.

SQLite is ideal for this sort of thing. Even better, Python has a builtin sqlite3 module.

(I could use SQLAlchemy, and have on other projects, but again, it would be overkill here).

To take a specific use-case, we have a piece of software which generates reports for multiple ‘projects’. Individual project reports can be generated weekly or monthly. We want to keep track of when a report was last generated for each project so that we don’t generate more than one in a given report period.

The basic implementation looks something like this

class LastRunSQL:

    def __init__(self, connection: sql.Connection):
        self.connection = connection
        self.connection.execute(
            "CREATE TABLE IF NOT EXISTS 'lastrun' "
            "(name TEXT PRIMARY KEY, timestamp TIMESTAMP)"
        )

    def __getitem__(self, name: str) -> datetime:
        item = self.connection.execute(
            "SELECT timestamp FROM 'lastrun' WHERE name=?", (name,)
        ).fetchone()

        if item is None:
            raise KeyError(name)

        return item[0]

    def __setitem__(self, name: str, timestamp: datetime):
        self.connection.execute(
            "REPLACE INTO 'lastrun' VALUES(?,?)", (name, timestamp)
        )
        self.connection.commit()


Note - when no matches are found, sqlite3 will return None. So for consistency with python dict behaviour, we have to check the return value and raise a KeyError if nothing is found.

Having the table names as a literal, scattered throughout the code, feels a little messy. If it weren’t SQL, I’d probably use a variable and string formatting. Perhaps I’m being overly cautious.

Types and Testing

Once we have an instance of this class, the access pattern looks like

db['foo'] = datetime.now()
print(db['foo'])


That is, it works like a regular dict.

This is intentional. The idea is that, for testing, we can just switch out for an actual dict ¹.

Now, we could make this a subclass of MutableMapping. But if we look at the definition of MutableMapping, we see that we would need to also implement __delitem__, __iter__, and __len__ methods. And while that would be easy enough to do ², we simply don’t need it here.

So what can we do instead, to satisfy type-checking? ³

class LastRun(Protocol):

    def __getitem__(self, name: str) -> datetime: ...
    def __setitem__(self, name: str, timestamp: datetime): ...


Using a Protocol ⁴ means anything with the __getitem__ and __setitem__ methods will satisfy the LastRun ‘type’ - including our LastRunSQL, as well as Dict[str,datetime]

Extending the Interface

In another case, I needed to be able to insert a lot of entries at once.

With the implementation above, we can happily do

for key, value in items:
    db[key] = value


However, this would be quite inefficient - especially since that implementation does a commit after every insert.

To work around it, I defined another protocol ⁵

@runtime_checkable
class BulkInsertable(Protocol):

    def bulk_insert(self, items: Iterable[Tuple[str, datetime]]): ...


And on the SQL class, added an implementation like

def bulk_insert(self, items: Iterable[Tuple[str, datetime]]):
    """Insert many items into the mapping.

    Unlike ``__setitem__`` it uses INSERT instead of REPLACE.
    Therefore, the mapping should be empty or the items should not
    already exist, or else a 'unique constraint' error may be raised
    """
    self.connection.executemany(
        "INSERT INTO 'lastrun' VALUES(?,?)", items
    )
    self.connection.commit()


Then in the code which did the mass insert, I have

if isinstance(db, BulkInsertable):
    db.bulk_insert(items)
else:
    for key, value in items:
        db[key] = value


Alternatively, this could be done as a baseclass or mixin, where the iterative method is the default implementation. The reason I did it as a protocol was, again, so I could drop in a regular dict without having to change anything.

Connections

Firstly, I should note that, for the datetime to work properly, we need to open the connection with detect_types=sql.PARSE_DECLTYPES

For one project, I just kept things as above - passing the database connection into the class init - because I was using the same database for multiple things (multiple tables).

But in this case, we were only using the database in one place, so I added a little helper ⁶

class LastRunSQL:

    # ...

    @classmethod
    @contextmanager
    def open(cls, path) -> Iterator["LastRunSQL"]:
        with sqlite3.connect(
            path, detect_types=sql.PARSE_DECLTYPES
        ) as conn:
            yield cls(conn)


with LastRunSQL.open('example.db') as db:
    # do a thing


Alternatives

Finally, I should mention that, along similar lines there is the builtin shelve module. Similar to what we’ve been discussing, it’s a persistent dict-like object, but in this case backed by pickling the data. Because of this, shelves support more datatypes than sqlite. It also has the benefit (or limitation, depending on your perspective) that it doesn’t have a fixed schema.

On the other hand, the resulting database can’t be used outside of python (if that’s something you care about). And SQLite3 supports concurrent access.

In any case, the nice thing about the protocol we defined is, we can switch SQLite out for anything else that satisfies the interface - including shelve. Or if we needed to scale up, we could use redis after all.

Chris.




  
    
      granted, we could also use a :memory: sqlite database ↩
    
    
      full implementation of a SQLite-backed MutableMapping ↩
    
    
      the ellipsis is valid python. We could also use pass ↩
    
    
      for python < 3.8, you’ll need to install typing_extensions ↩
    
    
      the @runtime_checkable decorator is required to make isinstance check work ↩
    
    
      the order of the decorators is important ↩