Apache Hive + bzip2 txt file. Why?

Hi, folks! Several weeks ago I run into illogical behavior of Apache Hive and I couldn’t find any explanations to that, so I decided to write this blog post to share those observations, hoping somebody will benefit from that(I spent half of a day debugging this) or at least could explain “why?”. So here is the story,

I was tasked to create a Hive table out of text(CSV) file with bzip2 compression. Everything was fine until I tried to run a query on a newly created table, no matter how easy or complex my query was, my job was always running with 1 map and 1 reduce task. This is clearly not what you would expect, especially when you have to process TBs of data and don’t want to wait several years to finish. I tried different settings of TEZ/Hive and browsed WWW for the answer with no luck. Finally, after many trial and errors, I was able to find root cause of this behavior and it was….. TBLPROPERTIES (“skip.header.line.count”=”1”) setting of table. This setting was preventing processing bzip2 txt file in a distributed fashion.

Let me walk you through a practical example of this.

Here I’m going to use GCP Dataproc cluster(1 Master + 3 Worker nodes) which has Hive installed out of the box, it also exposes YARN and TEZ Web UI. For more info on how to configure and use Hive on GCP Dataproc see here.
Data I use in this example is from here.

I pulled a bigger sample, unzipped it and then compressed with bzip2 and uploaded it to GCS bucket, that Dataproc has access to.

Once you are logged in to Hive, create DB:

CREATE DATABASE testdb;

Next we will set this DB as our default:

USE testdb;

Create Hive table with TBLPROPERTIES (“skip.header.line.count”=”1”):

CREATE EXTERNAL TABLE ratings_csv (
userId String,
movieId String,
rating String,
timestamp_a String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n' stored as textfile
LOCATION 'gs://dataproc-cea473b0-b991-4ae4-8b1b-855b58485241-us-central1/test-data/'
TBLPROPERTIES ("skip.header.line.count"="1");

Create Hive table without TBLPROPERTIES (“skip.header.line.count”=”1”):

CREATE EXTERNAL TABLE ratings_csv_2 (
userId String,
movieId String,
rating String,
timestamp_a String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n' stored as textfile
LOCATION 'gs://dataproc-cea473b0-b991-4ae4-8b1b-855b58485241-us-central1/test-data/';

And now is the fun part, if you run the same query on different tables you’ll have different in parallelization/tasks.

select sum(rating) from ratings_csv where userid='1';

Now let’s run same query on a second table :

select sum(rating) from ratings_csv_2 where userid='1';

Although the number of reducers didn’t increase, because our data is pretty small, a number of mappers now is 4 instead of 1. If you will run it with a really big dataset then you will see many more mappers/reducers. This behavior dramatically impacts your performance. Reader be aware of this “feature” in the Hive and avoid running into this trap. One of the ways to avoid it is to create another Hive table with “skip.header.line.count”=”1” setting and then run insert overwrite from initial table. Like below:

CREATE EXTERNAL TABLE ratings_orc (
userId String,
movieId String,
rating String,
timestamp_a String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n' stored as orc
TBLPROPERTIES ("skip.header.line.count"="1", "orc.compress"="zlib");

Populate data in new table:

INSERT OVERWRITE TABLE ratings_orc
SELECT userid, movieid, rating, timestamp_a
FROM ratings_csv_2;

Run your query and observe it runs fast as you would expect:

select sum(rating) from ratings_orc where userid='1';