loving yourself with EBOOKS

Monday, October 22, 2012 :: Tagged under: pablolife works engineering. ⏰ 6 minutes.


Hey! Thanks for reading! Just a reminder that I wrote this some years ago, and may have much more complicated feelings about this topic than I did when I wrote it. Happy to elaborate, feel free to reach out to me! 😄


While I'm still ramping up at the new gig, I've found a love of code again and recently launched new side project. I may still work on it yet, but chances are I'll leave it be for a bit while I pursue other interests (hey, it works well as is and I'm not shipping it :-p).

It's called Ebooker, and if you'd like to have a look at the source, fire me an email, I have it hosted on a private repo on Bitbucket.

[Update! I've put it on Github.]

"Ebooker"?

Ebooker takes it's name from @horse_ebooks, one of the most famous Twitter accounts that (legend has it) was a spam bot that forgot to sell its wares, only remembering to try to sound human to avoid spam detection. As a result, you get a sort of abstract poetry:

As a result, it's a bit customary to include the phrase "ebooks" in Twitter bots, even if they aren't trying to sell anything as @horse_ebooks was. Some examples are:

Given this, Ebooker takes as input any number of Twitter streams, and will create a bot that will imitate their union. The most obvious (and narcissistic) example is the new @SrPablo_ebooks, based on my Twitter stream:

But it gets more fun when you combine accounts. Ebooker powers @SrLaurelita, a combination of me and my girlfriend's (@laurelita) tweets:

The nice thing about Ebooker is that it does the tweeting for you: I tell it who I want to mix and match, how often I want it to tweet, and it does the rest.

How does it work?

Ebooker bots, like most Twitter bots, work via a mechanism called Markov chains. Despite the length and jargon in the Wikipedia entry, it's really pretty simple and I'll do my best to explain it to the curious, whatever your background.

The high-level view could be expressed as: Ebooker wants to make text that sounds like you, so it uses only the words you use, and tries to use them in the order that you use them. To do this, we first have to learn about you and your words.

This indicates the program should go through two phases: a consumption phase, where it reads your writing and does some analysis on it, and a generation phase, where it creates new text by using what it learned.

How does the analysis happen? First it puts the words you use into a table, then it uses that table to see which words you use most. Here's an example of how it works: previously-mentioned Karl Pichotta recently tweeted:

Ebooker would look at every word, and make a note of what follows it. For example, in the tweet above, it notes that every time Karl uses the word "school," the next word is "kids," and every instance of "whole" is followed by "sandwich" ("Don't do well in school, kids, or ... *whole sandwich falls out ...").

Now for the fun part... what happens for the word "of"? It comes up twice in this tweet, and gets followed by "life" and "beard." In our little table, we make a note of both "life" and "beard", and keep track of how many times we've seen them. The table might look like:

Word   Suffix, frequencies.
school kids, 1
whole sandwich, 1
... ... ...
of life, 1
beard, 1

Now suppose we've done for all of Karl's writing, ever (Ebooker uses all the tweets Twitter will give us, which is the last ~3200 tweets). The table would be huge, and many common words would have many possibilities. The table might look like:

Word   Suffix, frequencies.
Oxford University, 4
College, 1
Dictionary, 3
whole sandwich, 1
shebang, 4
matter, 2
void, 55
metric, 8
... repeat for thousands of entries... ...
of life, 1
beard, 3
mine, 7
his, 4
that, 9
it, 22

Some things about this table:

From here, generating text is easy. We do so probabilistically: we let the words that show up most often have more weight than those that don't.

One way to imagine this is with darts: Imagine a dartboard with a huge "outside zone" and a tiny bullseye. If you're terrible at throwing darts such that they land on the dartboard randomly, chances are it'll hit the outside zone frequently, and rarely hit the bullseye.

In this case, we'll start with any word in the table (let's say, "of") and the words that come after it form a dartboard. "void" will be the huge outside zone because it's shown up 55 times, and "sandwich" will be the bullseye, which we almost never hit. We ask the computer for a random number, like throwing a dart randomly. Wherever the number lands, we pick that as our next word. With the values above, the odds look like:

Suffix of "of"   Odds
sandwich 1/70
matter 2/70
shebang 4/70
metric 8/70
void 55/70

Once we've selected our next word, we look up that word in the table, just as we did for "of". We do this until we reach a word that isn't in the table, or (because this was made for Twitter) we run out of characters. When you're done, Viola!, you've created a nonsensical (but not totally random) phrase in someone else's voice!

(someone did this with Garfield comics, by the way... check out Garkov)

Bots managed by Ebooker

Ebooker is currently whirring away, sending up tweets for the following bots:

If there's a bot you'd like to see, let me know! I'm happy to get a few more up there!

Onwards!

I might write about this again, since getting the Markov chains was only a day of work (and that's in a language I'd never seen before). The more technically juicy stuff, from my end anyways, was implementing OAuth 1.0 in Go, as well as a first exposure to goroutines. These might get treated in a future post, but I thought this would be a cute way to describe, to non-technical folk, the kinds of problems programmers solve.

Thanks for the read! Disagreed? Violent agreement!? Feel free to join my mailing list, drop me a line at , or leave a comment below! I'd love to hear from you 😄