Deploy Big Data Application at scale with Ansible

If you have recently worked on Big Data project, you must have had some headaches when deploying your applications on your various environments. Like other non BigData projects, you are confronted with things like:
– configuration
– scripts
– provisioning
– libraries
– shared components
– …

This can be jar files to copy in HDFS, workflows and coordinators to deploy in Oozie, tables and namespaces to create in Hbase, and many other things. In short, all the components that make up a BigData application. Or course, you could just script your own provisioner. A friend of mine did it some weeks ago and describe it in a very rich article (in french). Altougth it work nicely, these bash/groovy/python/whatever-scripting-techno-you-love-to-use scripts are not the perfect choice:
– the scripts can usually not safety run again and again without a lot of extra coding
– sharing these scripts between applications or projects is complicated (shipped as a jar and unpacked, linked via git submodules…)
– since initiatives like these are started from a specific project and may not match needs for other projects, maintenance and genericity are difficult
– configuration is often included with the application, by application, and sharing common configuration is also complicated

I believe that rather than writing custom code to automate our systems, we should focus on the testability and quality of our packages. Now let’s find a provisioning tool which is:

  • repeatable/idempotent: operations can be executed multiple times without changing the result of the initial application
  • reliable
  • simple to use: well as simple as a provisioning and deployment tool can be…
  • maintenable
  • secure: we often need to manipulate password or credentials, we need a solution to securely use and share sensitive data
  • flexible: if we want to extend the tool, we have to be able to create new modules
  • reusable: common definitions have to be shared between applications
  • have a rich ecosystem and an active community

I’ve been working a lot with Ansible lately, and it match all the requirements listed above. Ansible is a configuration management and provisioning tool, similar to Chef, Puppet or Salt. The concept of idempotence is present in most Ansible modules: you specify the desired final state and Ansible decides if the task should be run. Ansible uses “Facts”, which is system and environment information it gathers (“context”) before running Tasks. Ansible uses these facts to check state and see if it needs to change anything in order to get the desired outcome. This makes it safe to run Ansible Tasks against a server over and over again.

Altought Ansible has many plugins (and this is crazy…), there is no plugin for managing BigData applications. The goal we want to achieve here is to manage and deploy:

  • libraries such a jar, .so…
  • scheduling resources: workflows, coordinators…
  • groovy/bash/python scripts
  • Avro schemas, HiveQL files…

In this article, I will present a way to deploy your BigData applications. You can find all the code in my personal Git repo


The code explained here is a proof of concept of what we can do with Ansible and as-is is not intended for real production usage.

How to?

In this article, we will focus on a simple example: deploy a Oozie woordinator and its workflow, with all the mandatory files, such as libraries and configuration, in HDFS. The hierarchy for each applications in HDFS will look like this:

├── apps
│   ├── app1
│   │   ├── conf
│   │   │   ├── hive-site-optimazed-for-tez.xml
│   │   │   └── spark-default.conf
│   │   ├── lib
│   │   │   ├── app1-1.1.jar
│   │   │   ├── common-1.0.jar
│   │   │   └── custom-hive-serde-1.3.jar
│   │   ├── oozie
│   │   │   ├── coordinators
│   │   │   │   └── all.xml
│   │   │   └── workflows
│   │   │       ├── ingestion.xml
│   │   │       └── ml.xml
│   │   ├── schemas
│   │   │   ├── foo-entity-v1.avro
│   │   │   └── foo-entity-v2.avro
│   │   ├── scripts
│   │   │   ├──
│   │   │   └──

The common HDFS layout and resources should be managed by a ‘common’ reusable role. All the applications use this role in addition with their custom roles, defining tasks for deploying the files in the dedicated directories.

All the configuration and the Oozie files are generated from template files. Template files can contain template variables, based on Python’s Jinja2 template engine. By convention, these template files should end in .j2. Our variables don’t have much in common between different environments, so we used the Alternative Directory Layout as described in the Ansible best practices page. This layout gives you more flexibility for larger environments, as well as a total separation of inventory variables between different environments. The Ansible layout look like this:

├── inventories
│   ├── development
│   │   ├── group_vars
│   │   │   └── all
│   │   │       ├── app1
│   │   │       ├── common
│   │   │       └── vault
│   │   └── hosts
│   └── production
│       ├── group_vars
│       │   └── all
│       │       ├── app1
│       │       ├── common
│       │       └── vault
│       └── hosts
├── library
│   ├──
│   ├──
│   └──
├── roles
│   ├── app1
│   │   ├── files
│   │   │   └── app1-notajar-1.1.jar
│   │   ├── handlers
│   │   │   └── main.yml
│   │   ├── tasks
│   │   │   └── main.yml
│   │   └── templates
│   │       ├──
│   │       ├── coordinator1.xml.j2
│   │       └── workflow1.xml.j2
│   └── common
│       └── tasks
│           └── main.yml
├── ansible.cfg
├── app1.yml
└── site.yml


This file controls Ansible behaviour. Adding an ansible.cfg file in the playbook root will ensure that the required settings for the playbook to run are really there. In this config file, we set at least one option:

hostfile = ./inventories/development

The hostfile variable in ansible.cfg should point to development inventories in order to avoid accidentaly messing with production, or any other environment. Executing the playbook on non-development inventories will force you tu use the -i, which is a good safety measure.

site.yml, app1.yml

We create a root playbook called site.yml in the playbook root directory, which includes all other playbooks.
The playbook defined in app1.yml contains only our 2 roles app1 and common.


This directory holds all inventories you want to apply your playbook too. We use a per-environment inventoriy: one for development, one for integration, another for production, etc…


Contains all our custom modules. Since there is no module hdfs and oozie, I’ve created them. The code is very simple: we use Knox for all interactions with the cluster. The Python module ‘requests’ is used for all the HTTP stuff.


Each sub directories in this folder represents a “role”. A role is a collection of tasks and templates. We have 2 roles defined here: common and app1. I suggest you to have a look on the official documentation if you want more infos.


We often need to store sensitive data in our Ansible templates, Files or Variable files; It unfortunately cannot always be avoided. Ansible has a solution for this called Ansible Vault which allows you to encrypt any Yaml file.

ansible-vault edit inventories/production/group_vars/all/vault

On my Github repo, I use ‘test’ for development inventory and ‘testprod’ for production.


Run all the tasks for all applications on development cluster:

ansible-playbook --ask-vault-pass -i inventories/development site.yml
# or
ansible-playbook --ask-vault-pass site.yml

Run all the tasks for all applications on production cluster:

ansible-playbook --ask-vault-pass -i inventories/production site.yml

Run all the tasks for the application 1 on production cluster:

ansible-playbook --ask-vault-pass -i inventories/production app1.yml


And that’s it. Congrats! You can successfully deploy your BigData applications with Ansible.

What next?

The module hdfs_copy uses checksum comparison in order to copy the file to HDFS. But in our implementation, the checksum method implies to download the remote file and then compute the checkum locally. HDFS provides a way to get the checksum of a file via the operation GETFILECHECKSUM, without downloading the file. But the checkum is not easy to compare since it is a MD5 of the MD5 on each block of a CRC32 of 512 bits, which is a little bit of work if we want to copy the behavior of the native HDFS checksum. A Jira is open for the Java part (Implement ChecksumFileSystem#getFileChecksum equivalent to HDFS for easy check, it should be a good starter if you want to do it in Python for Ansible.

Apache Knox User Guide
Ansible Documentation
Ansible Best Practices

“White Smoke of a Space Craft” by pixabay is licensed under CC0 1.0 / Resized

Related Posts

Leave a comment

About privacy:

This site uses Akismet to reduce spam. Learn how your comment data is processed.