Playing with Stack Overflow data

I used the data from Stack Overflow in order to see the interest on some of the products I follow (yes, HBase, Spark and others). The interest is calculated for each month on the last 5 years and is based on the number of posts and replies associated for a tag (ex: hdfs, elasticsearch and so on). Remember that Stack Overflow is a (huge) developper community with questions about programming, so the results are automatically biased. Indeed, because most of the posts are about products for the developpers, specific products or closed source solutions like analytiics platform or MPP are not well represented, the vendors providing to their communities their own forum/website. Despite these facts, we can observe some trends. Have a look:

Just a few impressions about this graph:
- 2016 has been a very good year for most of the products related to Hadoop (HBase, Hive, Impala, Phoenix, HDFS) or more generally products related to the BigData world (Cassandra, Druid, SparkSQL)
- Tags about timeseries products seem to be more present starting from mid-2015. InfluxDB and Prometheus have the most tags in this category.
- Posts about Elasticsearch started slowly in mid-2014 but really explodes in 2016. At the opposite, Solr loses in number of posts starting from mid-2014, but seems to re-gain posts from the beginning of 2015. Why? It's simple: Solr is the defacto search engine deployed by the two main Hadoop vendors, HortonWorks and Cloudera. Instead of deploying an other tool and paid an other support/license (let's named it: Elasticsearch), many clients prefer to stay in the Hadoop stack provided by these vendors. This behavior can be seen also for OpenTSDB or Pivotal HAWQ, which are shipped by HortonWorks and can be deployed by Ambari since a few months ago.
- Since 2015, MongoDB continues to be a tag more and more used on Stack Overflow. CouchDB keeps an average number of posts along the years. Other document storage technologies have been tagged several times in the last months: RethinkDB, Couchbase or OrientDB.
- Since mi-2013, Memcached slowly decreases while Redis explodes. Not really a big surprise: a recent survey published by Stack Overflow presents Redis as one of the most loved database with MongoDB.
- For the third year in a row, Neo4J is the most tagged graph solution on StackOverflow. TitanDB may be more complex to deploy (need Cassandra or HBase) compared to Neo4J. Despite the huge hype around Spark, GraphX is not a tag very used on SO.
- The use of the tag Hazelcast overtakes EhCache for the first time in five years. Apache Ignite shot to the most commonly tagged technology in the data grid category.
- HDFS is the most tagged DFS technology, far from the other alternatives like Lustre or Disco DDFS.
- The use of the tag Teradata overtakes the other newSQL products for the first time in five years. Firebird takes second place.

Credits:
"Some cheerful data" by dirkcuys is licensed under CC BY-SA 2.0 / Resized

Related Posts

Leave a comment