====== Storm Installation Guide ====== Apache Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Here's a summary of the steps for setting up a Storm cluster: - Set up a Zookeeper cluster - Install dependencies on Nimbus and worker machines - Download and extract a Storm release to Nimbus and worker machines - Fill in mandatory configurations into storm.yaml - Launch daemons under supervision using "storm" script and a supervisor of your choice ===== Set up a Zookeeper cluster ===== Storm uses Zookeeper for coordinating the cluster. Zookeeper is not used for message passing, so the load Storm places on Zookeeper is quite low. Single node Zookeeper clusters should be sufficient for most cases, but if you want fail-over or are deploying large Storm clusters you may want larger Zookeeper clusters. Setting up a ZooKeeper server in standalone mode is straightforward. The server is contained in a single JAR file, so installation consists of creating a configuration. Once you've downloaded a stable ZooKeeper release unpack it and cd to the root To start ZooKeeper you need a configuration file. Here is a sample, create it in conf/zoo.cfg: <code> tickTime=2000 dataDir=/var/zookeeper clientPort=2181 </code> This file can be called anything, but for the sake of this discussion call it conf/zoo.cfg. Change the value of dataDir to specify an existing (empty to start with) directory. Here are the meanings for each of the fields: **tickTime** the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime. **dataDir** the location to store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database. **clientPort** the port to listen for client connections Now that you created the configuration file, you can start ZooKeeper: <code>bin/zkServer.sh start</code> The steps outlined here run ZooKeeper in standalone mode. There is no replication, so if ZooKeeper process fails, the service will go down. This is fine for most development situations, but to run ZooKeeper in replicated mode, please see [[http://zookeeper.apache.org/doc/r3.3.3/zookeeperStarted.html#sc_RunningReplicatedZooKeeper | Running Replicated ZooKeeper]]. ===== Install dependencies on Nimbus and worker machines ===== Next you need to install Storm's dependencies on Nimbus and the worker machines. These are: - **ZeroMQ 2.1.7** Note that you should not install version 2.1.10, as that version has some serious bugs that can cause strange issues for a Storm cluster. In some rare cases, users have reported an "IllegalArgumentException" bubbling up from the ZeroMQ code when using 2.1.7 – in these cases downgrading to 2.1.4 fixed the problem. - **JZMQ** - **Java 6** - **Python 2.6.6** - **unzip** These are the versions of the dependencies that have been tested with Storm. Storm may or may not work with different versions of Java and/or Python. These dependencies have also their dependencies. Those can be installed on debian familiy of linux with the following commands: <code> sudo yum install libuuid-devel gcc gcc-c++ pkgconfig libtool autoconf automake </code> ==== ZeroMQ ==== Storm has been tested with ZeroMQ 2.1.7, and this is the recommended ZeroMQ release that you install. You can download a ZeroMQ release [[ http://download.zeromq.org/ | here]]. Installing ZeroMQ should look something like this: <code> wget http://download.zeromq.org/zeromq-2.1.7.tar.gz tar -xzf zeromq-2.1.7.tar.gz cd zeromq-2.1.7 ./configure make sudo make install </code> ==== JZMQ ==== JZMQ is the Java bindings for ZeroMQ. JZMQ doesn't have any releases, so there is risk of a regression if you always install from the master branch. To prevent a regression from happening, you should instead install from [[http://github.com/nathanmarz/jzmq | this fork ]] which is tested to work with Storm. Installing JZMQ should look something like this: <code> #install jzmq git clone https://github.com/nathanmarz/jzmq.git cd jzmq ./autogen.sh ./configure make sudo make install </code> ===== Download and extract a Storm release to Nimbus and worker machines ===== Next, download a Storm release and extract the zip file somewhere on Nimbus and each of the worker machines. The Storm releases can be downloaded from [[http://github.com/nathanmarz/storm/downloads | here]]. ==== Fill in mandatory configurations into storm.yaml ==== The Storm release contains a file at conf/storm.yaml that configures the Storm daemons. You can see the default configuration values [[https://github.com/nathanmarz/storm/blob/master/conf/defaults.yaml | here]]. storm.yaml overrides anything in defaults.yaml. There's a few configurations that are mandatory to get a working cluster: 1) **storm.zookeeper.servers:** This is a list of the hosts in the Zookeeper cluster for your Storm cluster. It should look something like: <code> storm.zookeeper.servers: - "111.222.333.444" - "555.666.777.888" </code> If the port that your Zookeeper cluster uses is different than the default, you should set **storm.zookeeper.port** as well. 2) **storm.local.dir:** The Nimbus and Supervisor daemons require a directory on the local disk to store small amounts of state (like jars, confs, and things like that). You should create that directory on each machine, give it proper permissions, and then fill in the directory location using this config. For example: <code> storm.local.dir: "/mnt/storm" </code> 3) **java.library.path:** This is the load path for the native libraries that Storm uses (ZeroMQ and JZMQ). The default of "/usr/local/lib:/opt/local/lib:/usr/lib" should be fine for most installations, so you probably don't need to set this config. 4) **nimbus.host:** The worker nodes need to know which machine is the master in order to download topology jars and confs. For example: <code> nimbus.host: "111.222.333.44" </code> 5) **supervisor.slots.ports:** For each worker machine, you configure how many workers run on that machine with this config. Each worker uses a single port for receiving messages, and this setting defines which ports are open for use. If you define five ports here, then Storm will allocate up to five workers to run on this machine. If you define three ports, Storm will only run up to three. By default, this setting is configured to run 4 workers on the ports 6700, 6701, 6702, and 6703. For example: <code> supervisor.slots.ports: - 6700 - 6701 - 6702 - 6703 </code> ==== Launch daemons under supervision using "storm" script and a supervisor of your choice ==== The last step is to launch all the Storm daemons. It is critical that you run each of these daemons under supervision. Storm is a fail-fast system which means the processes will halt whenever an unexpected error is encountered. Storm is designed so that it can safely halt at any point and recover correctly when the process is restarted. This is why Storm keeps no state in-process -- if Nimbus or the Supervisors restart, the running topologies are unaffected. Here's how to run the Storm daemons: - **Nimbus:** Run the command "bin/storm nimbus" under supervision on the master machine. - **Supervisor:** Run the command "bin/storm supervisor" under supervision on each worker machine. The supervisor daemon is responsible for starting and stopping worker processes on that machine. - **UI:** Run the Storm UI (a site you can access from the browser that gives diagnostics on the cluster and topologies) by running the command "bin/storm ui" under supervision. The UI can be accessed by navigating your web browser to http://{nimbus host}:8080. The daemons will log to the logs/ directory in wherever you extracted the Storm release.