DevOps

#

02 Apr: LLAP & CGroups: a marriage made in heaven

Hive LLAP (for Live Long and Process), also called Interactive Query on HDInsight, is a service whose promise is to provide performance below the second for queries on very large tables. To achieve interactive performance levels, LLAP relies on Hadoop by using the Tez execution engine and by adding LLAP daemons to cache data, manage JIT optimization, and eliminate most of the startup costs. Caching, pre-fetching, some query processing and access control are moved into the daemons….

#

09 Jul: Back in time: unreliable clocks and distributed computing

Many scalable NoSQL databases like Cassandra, HBase, Mongo, provide tunable consistency in order to define a specific guarantees level for an operation. And what make them scalable make them also vulnerable: in all case the whole cluster must run on synchronized clocks. It’s quite surprising that, given how important this is, it is not very detailled in the product documentation. One chapter in the HBase documentation, a pararaph in the MongoDB production readiness, a few lines in…

#

06 Dec: Deploy Big Data Application at scale with Ansible

If you have recently worked on Big Data project, you must have had some headaches when deploying your applications on your various environments. Like other non BigData projects, you are confronted with things like: – configuration – scripts – provisioning – libraries – shared components – … This can be jar files to copy in HDFS, workflows and coordinators to deploy in Oozie, tables and namespaces to create in Hbase, and many other things. In short, all…

#

25 Oct: Efficient logging with Spring Boot, Logback and Logstash

Logging is an important part of any entreprise application and Logback makes an excellent choice: it’s simple, fast, light and very powerful. Spring Boot has a great support for Logback and provides lot of features to configure it. In this article I will present you an integration for an entreprise logging stack using Logback, Spring Boot and Logstash. WARNING The Spring Boot recommands to use the -spring variants for your logging configuration (for example logback-spring.xml rather than…

#

10 Oct: OS monitoring with… Java

Sometimes it may be useful to get system information like the usage of a disk or the available network interfaces. For instance, Elasticsearch use this kind of tools in order to display at startup time some infos about open file descriptors or the size of the direct memory available for the JVM. The aim is not to replace a real system monitoring agent, but to guide the user to take advantage of the product by configuring it…

#

06 Oct: Find and kill slow running queries in MongoDB

In Mongo, or more generally in any data storage engine, queries or updates that take longer than expected to run can be caused by many reasons: – Slow network – Wrong schema design (we all have seen the famous all-in-one table…) – Wrong database design (“let’s store 100To of data in a standalone mongod!”) – Bad partitioning (Hbase table with 200 regions with 2MB of data) – Lack of useful indexes – No statistics – Incorrect hardware…

25 Nov: How to kill Hadoop jobs matching a pattern?

Today, I had to kill a list of jobs (45) running on my Hadoop cluster. Ok, let’s have a look to the docs http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#job But wait a minute… No, Hadoop knows the “kill” command, but not the “pkill”… One solution is: import java.io.IOException; import org.apache.commons.cli.CommandLine; import org.apache.commons.cli.CommandLineParser; import org.apache.commons.cli.HelpFormatter; import org.apache.commons.cli.Options; import org.apache.commons.cli.ParseException; import org.apache.commons.cli.PosixParser; import org.apache.commons.lang.ArrayUtils; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobStatus; import org.apache.hadoop.mapred.RunningJob; import org.slf4j.Logger; import org.slf4j.LoggerFactory; public class PKill { private final…

17 Oct: How to generate a changelog from Jira for your deb/rpm/…

A changelog is a log or record of changes made to a project, such as a website or software project, usually including such records as bug fixes, new features, etc. Some open source projects include a changelog as one of the top level files in their distribution. If you are running a RHEL distribution (Centos, Fedora, Red Hat…), you can read it via the rpm command: rpm -q –changelog vim-enhanced.x86_64 | less For Debian based distributions, you…

03 Sep: Installing Tomcat 7 on Debian/Ubuntu

First, a simple apt-get: apt-get install tomcat7 libtcnative-1 tomcat7-user tomcat7-docs tomcat7-admin Wait, “libtcnative-1”? Tomcat can use the Apache Portable Runtime to provide superior scalability, performance, and better integration with native server technologies. The Apache Portable Runtime is a highly portable library that is at the heart of Apache HTTP Server 2.x. APR has many uses, including access to advanced IO functionality (such as sendfile, epoll and OpenSSL), OS level functionality (random number generation, system status, etc), and…

15 Apr: How to deploy an Elasticsearch cluster easily

Here is a simple sh allowing you to deploy ElasticSearch on multiple servers with dedicated roles: master, slave or monitor. -Master: can be an Elasticsearch master, acts as load balancer on the cluster, doesn’t store data and can use the http transport. -Slave: a data node, can not be an Elasticsearch master and can not use the http transport. -Monitor: doesn’t store data, can not be an Elasticsearch master, hold plugins and can use the http transport….