2016 November

#

19 Nov: Working with Parquet files

Apache Parquet is a columnar storage format available for most of the data processing frameworks in the Hadoop ecosystem: Hive Pig Spark Drill Arrow Apache Impala Cascading Crunch Tajo … and many more! In Parquet, the data are compressed column by column. This means that commands like these: hdfs dfs -cat hdfs://nn1.example.com/file1 hdfs dfs -text /…/file2 can not work anymore on Parquet files, all you can see are binary chunks on your terminal. Thankfully, Parquet provides an…

#

16 Nov: Using HBase REST API with the Knox Java client

I’ve already introduced Knox in a previous post in order to deploy Spark Job with Knox using the Java client. This post is still about the Knox Java client, but we’ll see here an other usage with HBase. HBase provides a well documented and rich REST API with many endpoints exposing the data in various formats (JSON, XML and Protobuf!). First, we need to import the dependencies for the Knox Java client: <dependency> <groupId>org.apache.knox</groupId> <artifactId>gateway-shell</artifactId> <version>0.10.0</version> </dependency>…

#

09 Nov: Submitting Spark Job via Knox on Yarn

Apache Knox is a REST API Gateway for interacting with Apache Hadoop clusters. It offers an extensible reverse proxy exposing securely REST APIs and HTTP based services in any Hadoop platform. Althought Knox is not designed to be a channel for high volume data ingest or export, it is perfectly suited for exposing a single entrypoint to your cluster and can be seen as a bastion for all your applications. One of the possible use-case of Knox…

#

02 Nov: Microservices and gRPC: Use Atomix as service discovery

gRPC is a modern open source high performance RPC framework initiated by Google and supported by many languages and platforms (C++, Java, Go, Node, Ruby, Python and C# across Linux, Windows, and Mac). It is used by many projects (etcd/CoreOS, containerd/Docker, cockroachdb/Cockroach Labs…) and has reached a significant milestone with its 1.0 release. Used in a distributed environments where a large number of microservices are running, gRPC supports rich cloud oriented features like: – load balancing/discovery –…