loving yourself with EBOOKS
Monday, October 22, 2012 :: Tagged under: pablolife works engineering. ⏰ 6 minutes.
Hey! Thanks for reading! Just a reminder that I wrote this some years ago, and may have much more complicated feelings about this topic than I did when I wrote it. Happy to elaborate, feel free to reach out to me! 😄
While I'm still ramping up at the new gig, I've found a love of code again and recently launched new side project. I may still work on it yet, but chances are I'll leave it be for a bit while I pursue other interests (hey, it works well as is and I'm not shipping it :-p).
It's called Ebooker, and if you'd like to have a look at the source, fire me an email, I have it hosted on a private repo on Bitbucket.
[Update! I've put it on Github.]
"Ebooker"?
Ebooker takes it's name from @horse_ebooks, one of the most famous Twitter accounts that (legend has it) was a spam bot that forgot to sell its wares, only remembering to try to sound human to avoid spam detection. As a result, you get a sort of abstract poetry:
Thigh exercises so strongly targeted, they ll leave you crawling on floor, waving
— Horse ebooks (@Horse\_ebooks) April 20, 2012
Worms – oh my god WORMS
— Horse ebooks (@Horse\_ebooks) January 25, 2012
Crying is great exercise
— Horse ebooks (@Horse\_ebooks) September 14, 2012
As a result, it's a bit customary to include the phrase "ebooks" in Twitter bots, even if they aren't trying to sell anything as @horse_ebooks was. Some examples are:
- @bogost_ebooks, a bot imitating the game designer Ian Bogost.
- @kpich_ebooks, a bot imitating the esteemed Karl Pichotta.
- @RandomTedTalks, imitating the titles of TED Talks.
Given this, Ebooker takes as input any number of Twitter streams, and will create a bot that will imitate their union. The most obvious (and narcissistic) example is the new @SrPablo_ebooks, based on my Twitter stream:
I thought this was hell, but I'm afraid folk looking over my shoulder will think it's cool technology).
— olbaP reieM (@SrPablo\_ebooks) October 12, 2012
Natalie Imbruglia at the gate.
— olbaP reieM (@SrPablo\_ebooks) October 18, 2012
Graduated. Now goin' back to Providence. My brother Robert is a surprising beast
— olbaP reieM (@SrPablo\_ebooks) October 22, 2012
But it gets more fun when you combine accounts. Ebooker powers @SrLaurelita, a combination of me and my girlfriend's (@laurelita) tweets:
Note to self: don't forget the Flash Player security fix we just released! CONTAIN YOUR EXCITEMENT!!!
— Laurblo Jeierson (@SrLaurelita) October 20, 2012
Also, best addition to my gut and learned the names of the guys here are gay, or just damn stylish. Either way, let me tell you MY opinion
— Laurblo Jeierson (@SrLaurelita) October 17, 2012
Nobody got it? That's so hipster.
— Laurblo Jeierson (@SrLaurelita) October 17, 2012
The nice thing about Ebooker is that it does the tweeting for you: I tell it who I want to mix and match, how often I want it to tweet, and it does the rest.
How does it work?
Ebooker bots, like most Twitter bots, work via a mechanism called Markov chains. Despite the length and jargon in the Wikipedia entry, it's really pretty simple and I'll do my best to explain it to the curious, whatever your background.
The high-level view could be expressed as: Ebooker wants to make text that sounds like you, so it uses only the words you use, and tries to use them in the order that you use them. To do this, we first have to learn about you and your words.
This indicates the program should go through two phases: a consumption phase, where it reads your writing and does some analysis on it, and a generation phase, where it creates new text by using what it learned.
How does the analysis happen? First it puts the words you use into a table, then it uses that table to see which words you use most. Here's an example of how it works: previously-mentioned Karl Pichotta recently tweeted:
Don't do well in school, kids, or you could end up like me. Is that what you want out of life? \*whole sandwich falls out of beard\*
— Karl Pichotta (@kpich) October 23, 2012
Ebooker would look at every word, and make a note of what follows it. For example, in the tweet above, it notes that every time Karl uses the word "school," the next word is "kids," and every instance of "whole" is followed by "sandwich" ("Don't do well in school, kids, or ... *whole sandwich falls out ...").
Now for the fun part... what happens for the word "of"? It comes up twice in this tweet, and gets followed by "life" and "beard." In our little table, we make a note of both "life" and "beard", and keep track of how many times we've seen them. The table might look like:
Word | Suffix, frequencies. | |
school | → | kids, 1 |
whole | → | sandwich, 1 |
... | ... | ... |
of | → | life, 1 beard, 1 |
Now suppose we've done for all of Karl's writing, ever (Ebooker uses all the tweets Twitter will give us, which is the last ~3200 tweets). The table would be huge, and many common words would have many possibilities. The table might look like:
Word | Suffix, frequencies. | |
Oxford | → | University, 4 College, 1 Dictionary, 3 |
whole | → | sandwich, 1 shebang, 4 matter, 2 void, 55 metric, 8 |
... | repeat for thousands of entries... | ... |
of | → | life, 1 beard, 3 mine, 7 his, 4 that, 9 it, 22 |
Some things about this table:
- It only has words in it that Karl has used. If we made up sentences from these words, it would always contain words that have come straight from Karl's proverbial mouth.
- By keeping track of the number of times each word follows the previous one, we have an idea of what he likes to talk about by seeing how often he talks about it. Using the table above, we see he did say "whole sandwich" once, but he's said "whole void" 55 times. Karl loves talking about the void!
From here, generating text is easy. We do so probabilistically: we let the words that show up most often have more weight than those that don't.
One way to imagine this is with darts: Imagine a dartboard with a huge "outside zone" and a tiny bullseye. If you're terrible at throwing darts such that they land on the dartboard randomly, chances are it'll hit the outside zone frequently, and rarely hit the bullseye.
In this case, we'll start with any word in the table (let's say, "of") and the words that come after it form a dartboard. "void" will be the huge outside zone because it's shown up 55 times, and "sandwich" will be the bullseye, which we almost never hit. We ask the computer for a random number, like throwing a dart randomly. Wherever the number lands, we pick that as our next word. With the values above, the odds look like:
Suffix of "of" | Odds | |
sandwich | → | 1/70 |
matter | → | 2/70 |
shebang | → | 4/70 |
metric | → | 8/70 |
void | → | 55/70 |
Once we've selected our next word, we look up that word in the table, just as we did for "of". We do this until we reach a word that isn't in the table, or (because this was made for Twitter) we run out of characters. When you're done, Viola!, you've created a nonsensical (but not totally random) phrase in someone else's voice!
(someone did this with Garfield comics, by the way... check out Garkov)
Bots managed by Ebooker
Ebooker is currently whirring away, sending up tweets for the following bots:
- @SrPablo_ebooks, seeded from my stream, @SrPablo.
- @laurelita_ebook, from my girlfriend's Twitter, @laurelita.
- @SrLaurelita, from @SrPablo and @laurelita.
- @love_that_lita, from @laurelita and @love_that_goku.
If there's a bot you'd like to see, let me know! I'm happy to get a few more up there!
Onwards!
I might write about this again, since getting the Markov chains was only a day of work (and that's in a language I'd never seen before). The more technically juicy stuff, from my end anyways, was implementing OAuth 1.0 in Go, as well as a first exposure to goroutines. These might get treated in a future post, but I thought this would be a cute way to describe, to non-technical folk, the kinds of problems programmers solve.
Thanks for the read! Disagreed? Violent agreement!? Feel free to join my mailing list, drop me a line at , or leave a comment below! I'd love to hear from you 😄