Golden Nuggets at Berlin Buzzwords 2014

Berlin BuzzwordsIf you haven’t heard of Berlin Buzzwords, it’s the conference covering all the latest buzzwords surrounding the biggest buzzword of them all: Big Data. And boy is it buzzwordy! I’m doing my research in a field that is somewhat overlapping with big data – but as a first-time attendee and I was blown away by the sheer number of words buzzing around: ElasticSearch, Cassandra, YARN, Riak, Kibana, Storm, Mesos, ansible, docker, puppet, hadoop, cascading … I probably could go on and fill a couple of more paragraphs, but let’s not waste our precious screen real estate!

Just to make it clear, before it sounds like it’s all just buzz, the conference was actually a very good experience, and the location (KulturBrauerei) was indeed awesome! My personal highlight was meeting all so many people with very diverse backgrounds, working on all angles of big data. For me personally, as I’m not so familiar with how big data is really seen outside of the academic environment, this was eye-opening! All the abstract knowledge of it being useful in the ad industry, banking, retail, etc. was finally filled with some meaningful examples. For instance, I didn’t know that there is a real-time bidding for ads on a huge ad exchange every time when a page is accessed before it actually loads, flooding ad services with thousands of requests per seconds. My knowledge here was quite limited to Google AdSense and nothing else.

The second highlight was the keynote, where Ralf Herbrich from Amazon Berlin gave a look back on how he made research ideas into real products. The special treat here was the story about Microsoft’s TrueSkill online matchmaking system for Xbox Live. Herbrich and his colleagues used machine learning methods to pair up gamers playing at a similar skill level so that matches are actually interesting and don’t end up in one team being dominate. Combining the passion for his personal hobby of gaming and his passion for research sure sounded like the thing for me!

There were also some minor gripes I had with BBuzz. I know that presentations re-inventing the wheel will for sure never go away. At BBuzz, though, there were some speakers that really stretched my patience with talks that had glamorous titles but nothing to show for. You get this at scientific conferences as well, but I’ve never experienced it at this kind of mind-boggling level. Luckily it was just one or two of them.

Another part of the conference I expected more from was the barcamp. It was my first one ever, so I don’t have a point of reference. For me, it felt like regular conference talks, just with less preparation on the speaker side. As a contrast, five years ago I was at WikiSym. There, we had another style of unconference, which I liked way better. Just to briefly elaborate: it was held in a big hall, where everyone would just go to any free spot on the wall and put up a post-it with a topic dear to her or his heart. Afterwards, people would roam around, and spontaneous discussion clusters would form. As confusing as it sounds, I had the feeling that many more interesting things emerged – maybe voting with your feet during the discussions helps maintain focus more than a barcamp-style initially decided schedule.

I guess I rambled on for long enough already, but I still want to share some real gold nuggets I found at Buzzwords:

  • T-Digest: if you need to get an estimate of the median (or any other quantile) in your data (stream), look no further. It’s fast, memory-efficient, and highly accurate even for skewed distributions. Very helpful if you need that kind of data for downstream tasks like anomaly detection in timelines, which brings me to the second point.
  • Deep Learning for Anomaly Detection: Anomaly is defined in the sense of having a deviation from the “normal” data distribution, e.g. exceeding the 99.99 percentile (which can be neatly measured using T-Digest). However, defining “normal” is not so straight-forward if you input signal or distribution are complex. One way to cope with this problem is deep learning, where the actual underlying (latent) structures can be learned. An anomaly is then a deviation from the learned model. If this was too abstract to understand, I recommend watching Ted Dunning’s talk on youtube.
  • Hadoop break even point: A very interesting tidbit that is pretty obvious in hindsight, but easily overlooked. Hadoop was designed with Google scale in mind, however most users do not have clusters or even problems of that scale. Jobs that run for less than 50,000 cpu hours are actually hurt by fault-tolerance mechanism of checkpointing – hopefully I’m quoting the number correctly here. The reason is that hardware might be unreliable, but not THAT unreliable, and provisioning for failures in short jobs just kills the performance without a real benefit. In case something really goes wrong, the job can still be restarted from scratch, not much harm done.

By the way, all three of these are thanks to Ted Dunning. Besides giving a great presentation and having a deep understanding of the field, explaining things at a level of details that are truly insightful, he is also a very nice guy. Just for the sake of seeing him it was worth going to Berlin Buzzwords 2014!

Building Excitement for the Heidelberg Laureate Forum

I just read the latest blog post on the Heidelberg Laureate Forum, and I want to say that I very much share the feelings of Adrian Dudek, who wrote it. I’m also going to attend the Forum, where the greatest Mathematicians and Computer Scientists of our time meet young researchers to share their knowledge and wisdom, and I guess of course to inspire future great research.

This summer I have been to two other conferences, the SIGMOD in New York and the IJCAI in Beijing, both great conferences in the fields of databases and artificial intelligence. While I was of course happy to be there, my excitement about the Forum is far greater than it was about those two. I still cannot really imagine the fact that I’m going to meet so many Turing, Abel, and Fields laureates in only a few days!

The big question for me at the moment is what to do after the PhD, and what factors there are to actually be able to decide this question. The ever-looming question that I guess I share with a lot of my fellow students is if one is able to do great research and still live a “normal” life, something that Adrian Dudek also asked in the post on the HLF blog. I guess it must be. I know of great researchers who did do things that were not part of their field: Donald Knuth randomly sampling the Bible in his book 3:16 is just one example that comes to mind. Still, in times of deadlines (which seem to be coming up and being missed constantly), it is easy to forget this and you start assuming that the way to go is to publish even more papers on even better conferences. On the other hand, I naturally do not do this kind of work and see it just as work that “needs to be done”, but really enjoy doing it! This is exactly what makes reaching a balance really tricky. Let’s see what my peers have to say about this, and of course how the Laureates managed to answer this.

Ah, and of course I’m also really looking forward to the more “programmatic” aspect of the Forum, where the greatest scientists of my field present and discuss their work, and share their new ideas. This still sounds quite unreal to me!

Striking a Balance between Dark and Readable

When coding for a long time in a black-on-white editor, I always get problems with unfocused and sleepy eyes. Going white-on-black helps keeping my eyes focused, however reading the code is not as convenient as black-on-white (see blog.tatham.oddie.com.au for possible explanations). An alternative is to go for black-on-light-gray, which seems to be a good balance for me. My eyes are less sleepy and I can still read code pretty well. To further reduce the glare in my Eclipse, I use a white-on-black theme for the non-editor parts. This is how it looks:

Eclipse with DarkJuno theme and a black-on-gray editor.
Eclipse with DarkJuno theme and a black-on-gray editor.

Create this look in your eclipse following these steps:

  1. Install the DarkJuno theme for a full Eclipse makeover (works with Eclipse 4).
  2. Install the Eclipse Color Theme plug-in.
  3. Download and unzip eclipse-white-on-gray (based on Coding Horror v2 with the background modified to be a darker gray and other minor color adjustments).
  4. Import the unzipped color theme into Eclipse.

KyotoCabinet on Mac

Looking around for a good and simple key-value store to use in my current project, friends recommended KyotoCabinet. A bit of further research showed that for our application (AIDAsource available), in which we do lots of random reads, KyotoCabinet outperforms all other interesting solutions in that respect.

Installing KyotoCabinet

Getting KyotoCabinet on my Mac was straightforward. I’m using MacPorts, and a simple

port install kyotocabinet

did the trick.

However, AIDA is written in Java, which complicated things a bit. I first got the latest Java driver, but configure was first not finding the MacPorts KyotoCabinet installation (which makes sense and can be easily remedied) and secondly it was missing the jni header. My solution was as follows:

tar xzvf kyotocabinet-java-1.24.tar.gz
cd kyotocabinet-java-1.24
CPPFLAGS="-I/System/Library/Frameworks/JavaVM.framework/Versions/A/Headers" ./configure --with-kc=/opt/local/

If you are using HomeBrew or install KyotoCabinet directly from source, don’t add the –with-kc=… parameter to configure.

Still, make did not find the jni.h, looking in the wrong places. There probably is a less hacky solution than this, but I simply appended the line

-I/System/Library/Frameworks/JavaVM.framework/Versions/A/Headers

to the CPPFLAGS parameter in the Makefile that configure created. Now make will work, and you can install regularly:

make
(sudo) make install

Integrating KyotoCabinet in Eclipse

I’m working in Eclipse, and the Java driver of KyotoCabinet needs the LD_LIBRARY_PATH set. Eclipse allows for a convenient solution here. In the project build path, add the kyotocabinet.jar (now located in /usr/local/lib). Next, click the triangle to the left of the newly added jar and set the “Native library location” to the same path, /user/local/lib. This should do the trick, you can now create a kyotocabinet.DB object and work with it.

You can still run the java application from the command line using the -D parameter:

java -cp .:kyotocabinet.jar -Djava.library.path=.:/usr/local/lib

Setting the Database Type

KyotoCabinet allows for two different implementations of the actual storage that behave differently. One is a hash-backed storage, the other one a B+ tree storage. For lots of random access the hash based one is preferable, however the decision which one to choose really depends on the needs of your application. I want a hash, but the big question was how to tell the DB object to actually create a hash based DB and not a tree based one. After searching the Web unsuccessfully for a while I finally found a blog post detailing this. It all depends on the file extension you are using for the filename that is passed to the DB constructor. Using .kch as extension will create a hash based DB, using .kct a tree based one. You can even make KyotoCabinet completely in-memory by using ‘+’ or ‘-‘ as filename, minus creating a hash based, plus creating a tree based one.

Using KyotoCabinet

I won’t give any concrete examples, but my main intention for using KyotoCabinet is together with Google’s protobuf. Protocol buffers allow the specification of nested data structures which nicely serialize into a byte array. The serialization can then be fed to KyotoCabinet. This should make for a really nice way to store more complex, nested data which does not play well with relational databases.