The Storm framework is an open-source distributed and fault-tolerant real-time processing system used by many companies including Groupon, Twitter, Alibaba, Klout, etc. It’s the Hadoop of real-time processing and it can be used for real-time analytics, online machine learning and parsing social media stream, just to name a few. This post will help you getting started with Storm via a sample Java project that you can run on any desktop environment.

Few months ago I had the pleasure to attend the Data + Visualization Toronto Meetup‘s  hands-on session on Storm where we built a sample Storm topology that parsed Twitter’s sample feed and printed a list of top tweets based on their retweet counts in every 5 minutes.

It’s pretty amazing that starting from scratch you can get a basic storm topology up and running on a local box in a few minutes time. Check out this slightly modified version of the above mentioned sample project: https://github.com/davidkiss/storm-twitter-word-count. It parses the same Twitter sample feed using twitter4j, collects all the words in the tweets and at every 10 seconds it prints out the words with the most count.

While in Hadoop you have Map-Reduce jobs, in Storm you have two types of nodes: Spouts and Bolts. These nodes can receive data (tuples), process them and emit them to other nodes so they can be chained together.

A Storm topology

(source: https://github.com/nathanmarz/storm)
 

Spout

Spouts are nodes that produce data to be processed by other nodes. It can read data from HTTP streams, databases, files, message queues, etc. Here’s the code for the Spout that parses a Twitter feed using Twitter4J: https://github.com/davidkiss/storm-twitter-word-count/blob/master/src/main/java/com/kaviddiss/storm/TwitterSampleSpout.java

Bolt

Bolts can both receive and produce data in the Storm cluster. Here’s the code for a Bolt that receives tweets and emits all its words: https://github.com/davidkiss/storm-twitter-word-count/blob/master/src/main/java/com/kaviddiss/storm/WordSplitterBolt.java

Topology

A topology is an object that configures how the Storm cluster will look like: what Sprouts and Bolts it has and how they are chained together.

Here’s the code for the topology used in the sample: https://github.com/davidkiss/storm-twitter-word-count/blob/master/src/main/java/com/kaviddiss/storm/Topology.java.

Leave a comment below if you’re considering giving Storm a try or already using it.

—-

Update

Since I got many comments on running this from storm command line, I just updated the code to create a fat jar that can be run using this command:

storm jar storm-twitter-word-count-jar-with-dependencies.jar com.kaviddiss.storm.Topology
  • Joey Taleno

    Hi David,

    Great blog article about Storm!

    I’m trying to run your storm-twitter-word-count example but I’m getting this error.

    HTTP ERROR: 401
    Problem accessing ‘/1/statuses/sample.json’. Reason:
    Unauthorized

    Hope you can help me solve this problem. Thanks!

    Regards,
    Joey+

    • David

      Hi Joey,

      Thanks for pointing this out. To get the example working again, I had to upgrade the version of twitter4j library the project was using.
      Please note that instead of providing the twitter username and password as command line args, you’ll need to get your Twitter OAuth credentials from here and configure them as System properties for twitter4j in the example.

      Let me know how it goes for you,

      Best,
      David

  • Joey

    Hi David,

    Thanks for your help! It worked! More power to you. Looking forward for more Storm adventures!

    Regards,
    Joey+

    • David

      Hi Joey,

      Great to hear it worked for you and best of luck with your Storm adventures!

      David

  • Neat! 😉

  • Manish

    How can I get an access to https://github.com/ link, I am getting an error whenever I tried to access any link under https://github.com
    https://github.com/davidkiss/storm-twitter-word-count

  • Pingback: Processing Twitter feed using Spring Boot | Building scalable enterprise applications()

  • prasad

    unable to read and emit tuples fastly?

    i did sample example reading single file and emiting tuples to bolt ,here am finding tuple process time like below.

    while reading from i file am adding current time in milliseconds to tuple,when it reach to bolt find the difference of current time with tuple added time and do same thing when ack method of spout called finding diff of current time with tuple added time .

    here am unable to perform topology fast with 10 executors per each and 4 workers.

    for reaching tuple from spout to bolt it will take 127 seconds while in reaching to ack taking 40 + seconds

    please help me any configuration i forgot

  • Gurmukh Singh

    How to run this is storm. Can someone give me the complete command to run this in storm

    • David Kiss

      Run the Topology class as a Java application including the project’s classpath and make sure to pass in your Twitter credentials (see http://twitter4j.org/en/configuration.html#systempropertyconfiguration)

      • Rob

        Hi There, thanks for posting, wondering if you can help.
        I cloned your project to the OS, used ‘MVN package’ and created the 1 jar file. Attempted to run with storm passing the args Dtwitter as below.
        “storm jar storm-twitter-word-count-0.1.jar com.kaviddiss.storm.Topology -Dtwitter4j.oauth key = xxxxx and so on”, but unfortunately ran into an error: Exception in thread “main” java.lang.NoClassDefFoundError: twitter4j/StatusListener. I am new to this and 100% certain I have it wrong but not sure where? Any advice much appreciated. Thx, Rob.

        • David Kiss

          Hi Rob, you’ll also need to add the dependency jars to the classpath by creating a fat jar with maven that includes the dependencies (for details see here: http://www.mkyong.com/maven/create-a-fat-jar-file-maven-assembly-plugin/). In that case, in pom.xml you can set the scope of the storm dependency as runtime. That way it will not be included in the fat jar (as it’s provided by the storm command line)

          • Rob

            Hi David – thx for the reply, much appreciated. Yeah I thought I created a fat jar first time around with mvn package pointing it at the pom.xml. So I rebuild the package with mvn assembly after updating the pom, and redeploy to my VM instance and invoke with storm and args like before? Or try mvn exec and run locally? Sorry for the questions, I am just trying to get this running within my VM to get storm going. BTW the mkyong url link appears to be blank for me? Cheers, Rob.

          • David Kiss

            Hi Rob – sorry I didn’t come back earlier, I’m in the middle of moving to my new place. Thanks for the note, the mkyong link should work now.

            As for running the sample application, you have both options available (#1 invoke storm with fat jar and #2 run it using mvn exec), either should be fine.

  • kartik sathyanarayanan

    Hi, could you help with how to run this on Heron?

  • vdutt

    Hi David,

    Thanks for great explanation and code .
    Can we persist the word count in hdfs?

    Thanks
    Vishal