Working with Parquet files

Apache Parquet is a columnar storage format available for most of the data processing frameworks in the Hadoop ecosystem:

In Parquet, the data are compressed column by column. This means that commands like these:

hdfs dfs -cat hdfs://nn1.example.com/file1
hdfs dfs -text /.../file2

can not work anymore on Parquet files, all you can see are binary chunks on your terminal. Thankfully, Parquet provides an useful project in order to inspect Parquet file: Parquet Tools

For a more convenient use, Parquet Tools should be installed on all of your serveurs (Master, Data, Processing, Archiving and Edge nodes). All you have to do is to download the jar parquet-tools-.jar

NOTE
Currently these tools are available for UN*X systems.

Using it is pretty simple, just call the “hadoop jar” cli (for a local use, you can use instead “java -jar”)

hadoop jar /.../parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet

Here are the list of commands available (found from the source code):

  • cat: display all the content of the files in the standard output. Use the -j or –json to show records in JSON format
  • head: display only the first records. Number of rows can be given by the options -n or –records, default is 5
  • schema: show the schema. Use -d or –detailed to have more informations
  • meta: show the metadata stored in the footer of the file
  • dump: dump the file in the standard output. Multiple options are available:
    • -c,–column Dump only the given column, can be specified more than
      once
    • -d,–disable-data Do not dump column data
    • -m,–disable-meta Do not dump row group and page metadata
    • -n,–disable-crop Do not crop the output based on console width
  • merge: merge Parquet files/direcotry into a single file.

Let’s create a simple Parquet file and see what can be done:

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.stream.IntStream;

public class Main {

    public static void main(String[] args) throws IOException {

        Files.deleteIfExists(Paths.get("/tmp/parquet/data.parquet"));

        String schemaLocation = "/tmp/accesslog.json";
        Schema avroSchema = new Schema.Parser().parse(new File(schemaLocation));

        String path = "file:///tmp/parquet/data.parquet";

        try (ParquetWriter<GenericRecord> parquetWriter
                     = AvroParquetWriter.<GenericRecord>builder(new Path(path))
                .withCompressionCodec(CompressionCodecName.SNAPPY)
                .withSchema(avroSchema).build()) {

            IntStream.range(0, 10).boxed()
                    .map(i -> record(avroSchema, i))
                    .forEach(r -> {
                        try {
                            parquetWriter.write(r);
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    });
        }

        try (ParquetReader<GenericRecord> parquetReader
                     = AvroParquetReader.<GenericRecord>builder(new Path(path)).build()) {
            GenericRecord read;
            while ((read = parquetReader.read()) != null) {
                List<Schema.Field> fields = read.getSchema().getFields();
                System.err.println("--------");
                for (Schema.Field f : fields) {
                    System.err.println(f.name() + ": " + read.get(f.pos()));
                }
            }
        }
    }

    private static GenericRecord record(Schema avroSchema, int id) {
        GenericRecord record = new GenericData.Record(avroSchema);
        record.put("id", id);
        record.put("useragent", "LeUserAgent");
        record.put("ip", "10.0.0." + id);
        record.put("path", "/path/" + id);
        return record;
    }
}

No need to deal with Spark or Hive in order to create a Parquet file, just some lines of Java. A simple AvroParquetWriter is instancied with the default options, like a block size of 128MB and a page size of 1MB. Snappy has been used as compression codec and an Avro schema has been defined:

{
   "type":"record",
   "name":"AccessLog",
   "namespace":"fr.layer4.parquet",
   "fields":[
      {
         "name":"id",
         "type":[
            "int",
            "null"
         ]
      },
      {
         "name":"useragent",
         "type":[
            "string",
            "null"
         ]
      },
      {
         "name":"ip",
         "type":[
            "string",
            "null"
         ]
      },
      {
         "name":"path",
         "type":[
            "string",
            "null"
         ]
      }
   ]
}

We will need these dependencies too:

        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.9.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.3</version>
        </dependency>

Let’s try to see the content with the HDFS CLI:

./bin/hdfs dfs -cat /tmp/parquet/data.parquet
PAR1\Z,  .$l  "L8
                                                    LeUserAgent,
                                                                             LeUserAgent
                                                                                         LeUserAgent�z,10.0.0.910.0.0.0~010.
                            1
                              2
                                3
                                  4
                                    5
                                      6
                                        7
                                          010.0.0.9�|,/path/9/path/0t@/path/0
                                                                                              1
                                                                                                2
                                                                                                  3
                                                                                                    4
                                                                                                      5
                                                                                                        6
                                                                                                          7
                                                                                                            ,8/path/9\Hfr.layer4.parquet.AccessLog%id
                                           %  useragent%
                                                           %ip%
                                                                   %path%L5id��<   &�
          5 useragent��&�<
                                        LeUserAgent
                                                    LeUserAgent,&�
                                                                                  5ip��&�<10.0.0.910.0.0.0&�
               5path��&�</path/9/path/0�,parquet.avro.schema�{"type":"record","name":"AccessLog","namespace":"fr.layer4.parquet","fields":[{"name":"id","type":["int","null"]},{"name":"useragent","type":["string","null"]},{"name":"ip","type":["string","null"]},{"name":"path","type":["string","null"]}]}writer.model.nameavroIparquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)�PAR1%

Not very readable… Even without compressing the data, (CompressionCodecName.UNCOMPRESSED), the data are not well displayed. Ok now let’s use the Parquet Tools. The cat is pretty simple, it just display the content record by record:

java -jar /home/devil/git/parquet-mr/parquet-tools/target/parquet-tools-1.9.0.jar cat --debug file:///tmp/parquet/data.parquet
id = 0
useragent = LeUserAgent
ip = 10.0.0.0
path = /path/0

id = 1
useragent = LeUserAgent
ip = 10.0.0.1
path = /path/1

id = 2
useragent = LeUserAgent
ip = 10.0.0.2
path = /path/2

...

The head do the same, but display only the first N records:

java -jar /home/devil/git/parquet-mr/parquet-tools/target/parquet-tools-1.9.0.jar head -n 1 --debug /tmp/parquet/data.parquet
id = 0
useragent = LeUserAgent
ip = 10.0.0.0
path = /path/0

The meta command display a synthetic view of the metadata of the file like the schema, the row groups, the version of Parquet used to build this file…

java -jar /home/devil/git/parquet-mr/parquet-tools/target/parquet-tools-1.9.0.jar meta --debug /tmp/parquet/data.parquet
file:        file:/tmp/parquet/data.parquet
creator:     parquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)
extra:       parquet.avro.schema = {"type":"record","name":"AccessLog","namespace":"fr.layer4.parquet","fields":[{"name":"id","type":["int","null"]},{"name":"useragent","type":["string","null"]},{"name":"ip","type":["string","null"]},{"name":"path","type":["string","null"]}]}
extra:       writer.model.name = avro

file schema: fr.layer4.parquet.AccessLog
--------------------------------------------------------------------------------
id:          OPTIONAL INT32 R:0 D:1
useragent:   OPTIONAL BINARY O:UTF8 R:0 D:1
ip:          OPTIONAL BINARY O:UTF8 R:0 D:1
path:        OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:10 TS:486 OFFSET:4
--------------------------------------------------------------------------------
id:           INT32 SNAPPY DO:0 FPO:4 SZ:78/79/1,01 VC:10 ENC:PLAIN,RLE,BIT_PACKED
useragent:    BINARY SNAPPY DO:0 FPO:82 SZ:87/83/0,95 VC:10 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
ip:           BINARY SNAPPY DO:0 FPO:169 SZ:103/168/1,63 VC:10 ENC:PLAIN,RLE,BIT_PACKED
path:         BINARY SNAPPY DO:0 FPO:272 SZ:102/156/1,53 VC:10 ENC:PLAIN,RLE,BIT_PACKED

The dump command is even more verbose and display the data row group by row group then column by column:

java -jar /home/devil/git/parquet-mr/parquet-tools/target/parquet-tools-1.9.0.jar dump fil/tmp/parquet/data.parquet
row group 0
--------------------------------------------------------------------------------
id:         INT32 SNAPPY DO:0 FPO:4 SZ:78/79/1,01 VC:10 ENC:RLE,PLAIN,BIT_PACKED
useragent:  BINARY SNAPPY DO:0 FPO:82 SZ:87/83/0,95 VC:10 ENC:RLE,PLAI [more]...
ip:         BINARY SNAPPY DO:0 FPO:169 SZ:103/168/1,63 VC:10 ENC:RLE,P [more]...
path:       BINARY SNAPPY DO:0 FPO:272 SZ:102/156/1,53 VC:10 ENC:RLE,P [more]...

    id TV=10 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST [more]... VC:10

    useragent TV=10 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... VC:10

    ip TV=10 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST [more]... VC:10

    path TV=10 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST [more]... VC:10

INT32 id
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 ***
value 1:  R:0 D:1 V:0
value 2:  R:0 D:1 V:1
value 3:  R:0 D:1 V:2
value 4:  R:0 D:1 V:3
value 5:  R:0 D:1 V:4
value 6:  R:0 D:1 V:5
value 7:  R:0 D:1 V:6
value 8:  R:0 D:1 V:7
value 9:  R:0 D:1 V:8
value 10: R:0 D:1 V:9

BINARY useragent
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 ***
value 1:  R:0 D:1 V:LeUserAgent
value 2:  R:0 D:1 V:LeUserAgent
value 3:  R:0 D:1 V:LeUserAgent
value 4:  R:0 D:1 V:LeUserAgent
value 5:  R:0 D:1 V:LeUserAgent
value 6:  R:0 D:1 V:LeUserAgent
value 7:  R:0 D:1 V:LeUserAgent
value 8:  R:0 D:1 V:LeUserAgent
value 9:  R:0 D:1 V:LeUserAgent
value 10: R:0 D:1 V:LeUserAgent
...

And the merge (use the code example above in order to generate 2 files):

java -jar /home/devil/git/parquet-mr/parquet-tools/target/parquet-tools-1.9.0.jar merge --debug /tmp/parquet/data.parquet /tmp/parquet/data2.parquet /tmp/parquet/merge.parquet

That’s all!

Sources:
https://en.wikipedia.org/wiki/Apache_Parquet
https://github.com/apache/parquet-mr/tree/master/parquet-tools
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
Snappy

Credits:
“Wood” by linthesky is licensed under CC BY-NC 2.0 / Upscale and resized

Related Posts

Comments (1)

[…] Not very readable… And there is no formatter available. If you want to do it, you need first to find the Avro schema used when the values have been stored in HBase. For this example, I will use this one from my previous post about Parquet […]

Leave a comment