Testing BigData projects

Writing tests that use a traditional database is hard. But writing tests in a project using Hadoop is really harder. Hadoop stacks are complex pieces of software and if you want to test your Hadoop projects, it may be a real nightmare:
– many components are involved, you are not just using HBase, but HBase, Zookeeper and a DFS.
– a lot of configuration is needed
– cleaning the data of the previous tests relies on many steps: truncate the HBase table, stop and delete the running Oozie workflows, clean HDFS…

We are not talking about unit testing, but only about integration and functional testing. Unit tests should never use external systems such as a webservice or a database. Moreover, the goal is not to verify that the Hive query has been correctly generated in respect of the SQL-2011 or the Oozie XML worlflow is valid against a XSD, but to ensure that this Hive query returns the correct results and our Oozie workflow has been accepted and is now running.

Take this example: a webapp using Knox as entrypoint for the Hadoop cluster, it puts a file in HDFS and add a row in HBase. 4 components are used: Zookeeper, HDFS, HBase and Knox. This post will expose and compare 4 solutions: a VM sandbox, via Docker, using an embedded stack and light external standalone.

VM Sandbox

Most of the major Hadoop vendors provide a all-in-one virtual machine with their products installed:
Hortonworks Sandbox
Cloudera Quickstart
MapR Sandbox
– …

Pros:
– This is actually the most complete mode: all the components are installed in the same way (or very close to) as your real Hadoop cluster. You have access to the administration panel (Cloudera Manager/Ambari for Hortonworks…), so if your application calls the REST API of your administration panel (in order to retrieve the configuration or whatever), this is the only mode offering you this possibility.
Cons:
– Unless you have a laptop with a Xeon/Core i7 HQ and 32GB of RAM, this mode can be veeeeery slow
– The cleaning phase between your tests may be complicated, you can not just press a red button in order to reset your VM

Docker

Many components of the Hadoop stack are available on Docker Hub. You will find Ambari from SequenceIQ, Zookeeper, Spark on Yarn and much more. Cloudera even provides its Cloudera Quickstart as a simple Docker image

Pros:
– You can compose complex stack, even outside the default Hadoop stack (Elasticsearch, MongoDB, Cassandra…)
Cons:
– Managing the configuration and the dependencies can be difficult
– Contributions and quality may vary, some are very popular and ready-to-use when others are just experimental (or just don’t work)
– The cleaning phase between your tests may be complicated

In the JVM

I recently contribute to the project hadoop-mini-cluster. hadoop-mini-clusters provides an easy way to test Hadoop projects directly in your IDE, without the need for a full blown development cluster or container orchestration (you said Docker?). If you have already worked with JPA/Hibernate, hadoop-mini-cluster is the hsqldb (or h2) for your Hadoop project.

NOTE
Most of the projects in hadoop-mini-cluster will work on Windows 7 except MapReduce/YARN. See: Access Denied Windows 7

First import “some” Maven dependencies:

<!-- Mini clusters deps -->
<minicluster.version>0.1.11</minicluster.version>
<knox.version>0.9.0.2.5.3.0-37</knox.version>

<dependency>
    <groupId>com.github.sakserv</groupId>
    <artifactId>hadoop-mini-clusters-hdfs</artifactId>
    <version>${minicluster.version}</version>
    <exclusions>
        <exclusion>
            <groupId>javax.servlet</groupId>
            <artifactId>servlet-api</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>com.github.sakserv</groupId>
    <artifactId>hadoop-mini-clusters-hbase</artifactId>
    <version>${minicluster.version}</version>
    <exclusions>
        <exclusion>
            <groupId>org.mortbay.jetty</groupId>
            <artifactId>servlet-api-2.5</artifactId>
        </exclusion>
        <exclusion>
            <groupId>asm</groupId>
            <artifactId>asm</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>com.github.sakserv</groupId>
    <artifactId>hadoop-mini-clusters-zookeeper</artifactId>
    <version>${minicluster.version}</version>
</dependency>
<dependency>
    <groupId>com.github.sakserv</groupId>
    <artifactId>hadoop-mini-clusters-knox</artifactId>
    <version>${minicluster.version}</version>
    <exclusions>
        <exclusion>
            <groupId>javax.servlet</groupId>
            <artifactId>servlet-api</artifactId>
        </exclusion>
    </exclusions>
</dependency>

<!-- Knox services -->
<dependency>
    <groupId>org.apache.knox</groupId>
    <artifactId>gateway-service-webhdfs</artifactId>
    <version>${knox.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.knox</groupId>
    <artifactId>gateway-service-hbase</artifactId>
    <version>${knox.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.knox</groupId>
    <artifactId>gateway-util-common</artifactId>
    <version>${knox.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.knox</groupId>
    <artifactId>gateway-provider-security-authc-anon</artifactId>
    <version>${knox.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.knox</groupId>
    <artifactId>gateway-provider-identity-assertion-pseudo</artifactId>
    <version>${knox.version}</version>
</dependency>

<!-- Knox client -->
<dependency>
    <groupId>org.apache.knox</groupId>
    <artifactId>gateway-shell</artifactId>
    <version>0.11.0</version>
    <exclusions>
        <exclusion>
            <groupId>org.apache.knox</groupId>
            <artifactId>gateway-util-common</artifactId>
        </exclusion>
    </exclusions>
</dependency>

<!-- Others -->
<dependency>
    <groupId>javax.servlet</groupId>
    <artifactId>javax.servlet-api</artifactId>
    <version>3.1.0</version>
</dependency>
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
</dependency>

Next, let’s code and prepare the test environment:

private static HbaseLocalCluster hbaseLocalCluster;
private static ZookeeperLocalCluster zookeeperLocalCluster;
private static HdfsLocalCluster hdfsLocalCluster;
private static KnoxLocalCluster knoxCluster;

@BeforeClass
public static void setUp() throws Exception {
    zookeeperLocalCluster = new ZookeeperLocalCluster.Builder()
        .setPort(12345)
        .setTempDir("embedded_zookeeper")
        .setZookeeperConnectionString("localhost:12345")
        .setMaxClientCnxns(60)
        .setElectionPort(20001)
        .setQuorumPort(20002)
        .setDeleteDataDirectoryOnClose(false)
        .setServerId(1)
        .setTickTime(2000)
        .build();
    zookeeperLocalCluster.start();

    hdfsLocalCluster = new HdfsLocalCluster.Builder()
        .setHdfsNamenodePort(8020)
        .setHdfsNamenodeHttpPort(50070)
        .setHdfsTempDir("embedded_hdfs")
        .setHdfsNumDatanodes(1)
        .setHdfsEnablePermissions(false)
        .setHdfsFormat(true)
        .setHdfsEnableRunningUserAsProxyUser(true)
        .setHdfsConfig(new Configuration())
        .build();
    hdfsLocalCluster.start();

    hbaseLocalCluster = new HbaseLocalCluster.Builder()
        .setHbaseMasterPort(25111)
        .setHbaseMasterInfoPort(-1)
        .setNumRegionServers(1)
        .setHbaseRootDir("embedded_hbase")
        .setZookeeperPort(12345)
        .setZookeeperConnectionString("localhost:12345")
        .setZookeeperZnodeParent("/hbase-unsecure")
        .setHbaseWalReplicationEnabled(false)
        .setHbaseConfiguration(new Configuration())
        .activeRestGateway()
            .setHbaseRestHost("localhost")
            .setHbaseRestPort(28000)
            .setHbaseRestReadOnly(false)
            .setHbaseRestThreadMax(100)
            .setHbaseRestThreadMin(2)
            .build()
        .build();

    hbaseLocalCluster.start();

    knoxCluster = new KnoxLocalCluster.Builder()
            .setPort(8888)
            .setPath("gateway")
            .setHomeDir("embedded_knox")
            .setCluster("mycluster")
            .setTopology(XMLDoc.newDocument(true)
                    .addRoot("topology")
                        .addTag("gateway")
                            .addTag("provider")
                                .addTag("role").addText("authentication")
                                .addTag("enabled").addText("false")
                                .gotoParent()
                            .addTag("provider")
                                .addTag("role").addText("identity-assertion")
                                .addTag("enabled").addText("false")
                                .gotoParent().gotoParent()
                        .addTag("service")
                            .addTag("role").addText("NAMENODE")
                            .addTag("url").addText("hdfs://localhost:8020")
                            .gotoParent()
                        .addTag("service")
                            .addTag("role").addText("WEBHDFS")
                            .addTag("url").addText("http://localhost:50070/webhdfs")
                            .gotoParent()
                        .addTag("service")
                            .addTag("role").addText("WEBHBASE")
                            .addTag("url").addText("http://localhost:28000")
                    .gotoRoot().toString())
            .build();
    knoxCluster.start();
}

We have started many components: Knox, HDFS (a namenode with a unique datanode), Zookeeper and HBase (a master with a unique regionserver). Since Knox acts as a reserve proxy for HDFS and HBase, we activate the HBase REST Gateway and the HDFS HTTP endpoints, and configure a small topology for Knox. Now you can use your client (here the Knox Shell):

Hadoop hadoop = Hadoop.loginInsecure("https://localhost:8888/gateway/mycluster", "", "");

BasicResponse response = null;

// HDFS
try {
    response = Hdfs.mkdir(hadoop).dir("/tmp/example").now();
} finally {
    close("hdfs mkdir", response);
}
try {
    response = Hdfs.put(hadoop).file(".../my-file").to("/tmp/example/my-file").now();
} finally {
    close("hdfs put", response);
}

// HBase
try {
    response = HBase.session(hadoop).table("my_table").create()
            .family("family1").endFamilyDef()
            .family("family2").endFamilyDef()
            .now();
} finally {
    close("hbase create", response);
}
try {
    response = HBase.session(hadoop).table("my_table").row("row_id_1").store()
            .column("family1", "col1", "col_value1")
            .column("family1", "col2", "col_value2", 1234567890l)
            .column("family2", null, "fam_value1")
            .now();
} finally {
    close("hbase put", response);
}

hadoop.shutdown(10, TimeUnit.SECONDS);

And that’s it!

Don’t forget to stop the services:

@AfterClass
public static void tearDown() throws Exception {
    hbaseLocalCluster.stop();
    zookeeperLocalCluster.stop();
    hdfsLocalCluster.stop();
    knoxCluster.stop();
}

Pros:
– No more external dependencies
– The environment can be fully reset
– You can debug inside your IDE
– You can find extra projects: MongoDB, ActiveMQ…
– All the services are in the same JVM!
Cons:
– All the services are in the same JVM: you ~~may~~ will have to deal with classpath issues and jar hell
– Some components are missing (Ranger, Ambari, Accumulo, Phoenix…)
– You can not launch tests in parallel if you didn’t change the ports on each tests…

Outsite the JVM

A friend of mine started a huge work on a project called hadoop-unit. hadoop-unit is a standalone component which can be run locally and which simulate a hadoop cluster. In the remote mode, a dedicated Maven plugin drives an external process. hadoop-unit is based on hadoop-mini-clusters.

Pros:
– Light and simple
– You can use your shell!! (Hive, Kafka, HBase…)
Cons:
– Some components are missing (Ranger, Ambari, Accumulo, Phoenix…)
– The cleaning phase implies to restart the standalone

Summary

On your continous integration system, I strongly recommand to use the VM Sandbox approch for your nigthly builds on both your functional and integration tests, and hadoop-mini-clusters for the on-commit builds. The VM sandbox is the most complete and closest environment to your real platform. Most of the CI systems provide connectors in order to deploy and drive VM on vSphere, AWS, VirtualBox… But sandboxes can be difficult to configure, and starting a new VM between the various tests suite can be very slow, so the nigthly builds are the good places for this kind of tests. On the other side, hadoop-mini-clusters will be much faster and may be used for the regular/on-commits builds.

On your laptop, hadoop-unit and hadoop-mini-clusters are the best choices here: hadoop-mini-clusters can be used in your integration tests while hadoop-unit perfectly fit as a solution for your functional testing: you will be able to shutdown a component, see how your application is responding and verify that errors and timeouts are properly managed.

Credits:
“Traffic Lights with Red Light on” by wdnet is licensed under CC0 1.0 / Resized

Related Posts

Leave a comment