I’ve been interested in NoSQL solutions for quite some time. A few years ago I remember a mentor telling me all about a database system that Google had implemented that would be great for healthcare systems and that it also shoots laser beams with godlike accuracy and vengeance. Needless to say…that sparked my interest.
Aside from self improvement tutorials, I’ve never had a chance to implement a NoSQL solution because it’s never been the right tool for any of my jobs, or because there was so much FUD around the NoSQL idea that I couldn’t get any buy in from management. Finally, I’ve been lucky enough to be involved with a NoSQL proof of concept system, and our first wave of research is being done with Apache Cassandra.
Why Cassandra? Isn’t she just some totally hot rock babe that is awesome? Well, no…not entirely. IMHO, Cassandra is a rather exotic system. I don’t think I need to give you the fluff on what she is and isn’t, but I can list a few of the reasons for why we have chosen to research the system;
The HA/DR capabilities when utilizing Cassandra across data centers is hot. When I fantasize about Cassandra clustering…I think about losing a node here and there, and not caring (or at least not going into a code red panic)…schawing!
When I think of being able to use the Apache license…free, and release all those licensing costs and hopefully trade them for hardware and maybe even more training so that our internal teams can gain knowledge and intelligence…schawing!!
For a company with unknown growth expectations, the ability to scale painlessly and quickly if needed…schawing!!!
And when I drift off, allowing myself to imagine being able to run everything on linux, which in my experience has always been much more of a stable, manageable, and sexier operating system….SCHAAAAWING!!!!
So first things first, I needed to whip up some linux machines in order to test this. I setup 3x Ubuntu 14.04 vms, and 3x Debian 7.5 vms. You need at least 3 of something, so what distro you use is personal preference and I’m an apt-get kind of guy so I go with a debian base. Installing most flavors of linux is pretty much faceroll, and I’m not sure how it can be fucked up (although I’m sure someone out there will amaze me). I went with defaults on everything but the package selection where I chose; OpenSSH Server, and Laptop Utils (because my vms are always on a laptop). Once the systems were installed, I apt-getted the updates and upgrades. So far so good.
After the install I did a little bit o googling and found the articles I wanted to work with, but before approaching Cassandra, I had to instal java first.
For Ubuntu the homies at webupd8 has got our backs. You will need to first install the add-apt-repository tool with
sudo apt-get install software-properties-common python-software-properties
For Debian…oh LOOK webupd8 still gots our backs
So if you have been following along, after installing the OS and Oracle Java you are probably getting pretty comfortable with linux (or at least I hope you’ve learned how to cut n paste into a terminal). Next Digital Ocean has the goods for getting Cassandra installed, and this my friends, is where we start getting Cassandra talking.
I started at the section titled “Installing Cassandra” I skipped the JVM configuration options (because I want to investigate them to see what it’s actually doing and why it’s needed).
At this point, I was feeling pretty good about the whole thing, no weird errors and the entire process was rather quick. I did this on all 3 nodes and saw that it was good. So I continued on to the next step…configuring the cluster, which thankfully, Digital Ocean has another very easy to to follow article full of win There is no magic here, its really just as easy as next next next, but with a little more finesse. A couple things that caused me a few minutes of WTF was not paying attention to the cluster_name and the seed/droplet ip addresses.
After running through these articles I then executed ~/cassandra/bin/nodetool status and saw all 3 nodes, all UP, and I also saw this was good.
Naturally, the next place I wanted to go was the cloud. I know lot’s of you are still cloud curious, even if you don’t admit it, but I’ve taken that trip, and I love both AWS and Azure.
Because the company I work for is gracious enough to provide me with a MSDN account, I gots fitty a month to waste on Azure, so I started there. First I created 3 Ubuntu VMs and followed the same routine as described above. Everything was good, until I went to the clustering portion, where I discovered I was unable to get the VMs to talk to one another. This was totally Rob Lowe…and we all know, Rob Lowe is an asshole…except for “No Retreat, No Surrender” where Bruce Lee totally trained him to kick ass…oh wait never mind, that was Van Damme. (Actually I don’t know if Rob Lowe is an asshole IRL, but in Wayne’s World…he is)
So after doing some house cleaning (Azure tends to collect alot of devjunk), I created a new cloud service, then creating the VMs from the Gallery (not quick create). This allows you to stick em all up in the same cloud service, resulting in network communication, which is not Rob Lowe, but in fact awesome. After I got through this step, I was able to get the cluster talking to each other, and I was also getting pretty good at turning Cassandra on.
Next, I kept my cloud curiousism going by moving to AWS (total slut), and this, to my delight was a much easier process thanks to the DataStax Auto Clustering AMI. I followed this article by Jenny Kim, and had a 3 node cluster running in minutes (pretty much as long as it takes to provision the machines).
At this point I wanted to seal the deal and connect with Cassandra, at a more physical level, so I downloaded Datastax Devcenter and created connection strings to each of the clusters. The local VM cluster wasn’t much of an issue once I added a rule to the firewall using:
The Azure and AWS clusters…well that was a different story.
After some research I was able to use cqlsh to connect to the AWS Datastax cluster, however when I changed the broadcast_address to the public ip address, I was able to connect, but when I ran nodetool status I kept getting a weird situation where each node would show UP in their local execution of nodetool, but would show the other down. So I hit up some guys at Datastax through twitter and received a helpful response
— Joaquin Casares (@joaquincasares) May 12, 2014
So that is where I sit so far with Cassandra, and I have since gone through this process a number of times (I’m looking into automating this because of how many times I’ve done it).