#

09 Jul: Back in time: unreliable clocks and distributed computing

Many scalable NoSQL databases like Cassandra, HBase, Mongo, provide tunable consistency in order to define a specific guarantees level for an operation. And what make them scalable make them also vulnerable: in all case the whole cluster must run on synchronized clocks. It’s quite surprising that, given how important this is, it is not very detailled in the product documentation. One chapter in the HBase documentation, a pararaph in the MongoDB production readiness, a few lines in…

#

02 Jul: Why (and how) you should stop writing shell scripts

If you worked on a Big Data project, you should have seen, and maybe used, some shell scripts. Honestly, I love hearing “The future is now” while talking about a bunch of scripts scheduled by Oozie, but it seems like we couldn’t create a data project in 2018 without some lets-run-it.sh file. For the last 7 years I have seen many people writing x-SH scripts for various reasons, but the main reason today (at least on Big…

#

24 Jun: Protobuf and lib conflicts: how to use gRPC with HBase

I’m a huge fan a gRPC. Really. I’ve talk about it some months (years now…) ago, and for now, it met all my needs: high-performance, light, well-structured, simple… and an active community behind it. Even Netflix, one of major pro-REST approach advocate in the open source community, began the switch to gRPC the last year and place Ribbon, their huge client side IPC library, in maintenance mode. And the ecosystem still grow: Nginx recently annonced a native…

#

06 Dec: Deploy Big Data Application at scale with Ansible

If you have recently worked on Big Data project, you must have had some headaches when deploying your applications on your various environments. Like other non BigData projects, you are confronted with things like: – configuration – scripts – provisioning – libraries – shared components – … This can be jar files to copy in HDFS, workflows and coordinators to deploy in Oozie, tables and namespaces to create in Hbase, and many other things. In short, all…

#

29 Mar: Playing with Stack Overflow data

I used the data from Stack Overflow in order to see the interest on some of the products I follow (yes, HBase, Spark and others). The interest is calculated for each month on the last 5 years and is based on the number of posts and replies associated for a tag (ex: hdfs, elasticsearch and so on). Remember that Stack Overflow is a (huge) developper community with questions about programming, so the results are automatically biased. Indeed,…

#

23 Feb: Testing BigData projects

Writing tests that use a traditional database is hard. But writing tests in a project using Hadoop is really harder. Hadoop stacks are complex pieces of software and if you want to test your Hadoop projects, it may be a real nightmare: – many components are involved, you are not just using HBase, but HBase, Zookeeper and a DFS. – a lot of configuration is needed – cleaning the data of the previous tests relies on many…

#

15 Feb: Fitting Java and Python with JPY

There are many libraries in Java (more than 176,649 unique artifacts indexed just on Maven Central), but sometimes you can not find what you are looking for, except for a Python equivalent. In a previous project, I had to deal with custom MaxMind databases. Maxmind provides a Java library with a database reader, but does not provides a database writer. After some researches, I found one official lib in Perl, and an other (unofficial) in Python. Since,…

#

18 Jan: Solr: manage time-based collections

If you use Solr as your fulltext search engine, you may be frustated to miss the excellent tool Curator from Elastic, which allow you to manage your indices. Cloudera offers an admin tool for Solr, named solrctl, a light utility to supervise a SolrCloud deployment. Although solrctl has some useful commands, you don’t have the possibility to delete old time-based collections. Time-based collections, and globally shard/partition per time frame, is a common pattern for agregation but also…

#

23 Dec: HBase: having fun with the shell

HBase shell is a full interactive JRuby shell (IRB) providing tools allowing you to query your data or execute admin commands on a HBase cluster. Since it uses JRuby, this shell is a powerful interactive scripting environment. This post is not about presenting you the commands available in the shell, you can easily find documentation or article on the Internet, but more about the possibilities of the shell. Add custom command Actually, there is not easy way…

#

06 Dec: Knox in production: avoid pitfalls and common mistakes

I’ve already post articles about Knox some weeks ago about two subjects: how to use the HBase REST API througth Knox and how to submit Spark job via the Knox API. In my current mission, many projects are now using Knox as main gateway for many services like HBase and HDFS, but also for Oozie, Yarn… After some weeks of development and deployment in production, I’ve decided to write a post about some troubles that you may…