Hive for High Fives.

History

Credit: Matthew T Rader

Why Hive?

For example, Word counting is a common task for learn Hadoop. Writing in Java that is at least 63 lines. In Hive, it is 7.

What is Hive?

How to Hive?

You can also use third party programs/platforms. I have used Apache Zeppelin notebooks. These are essentially coding notebooks for Hadoop. What is a coding notebook? Well they are online notebooks which allow both text and code to be placed side by side and hence you can write about what you want, explain all the functions and have you or your views run code directly below the explanation. I couldn’t recommend them enough!

Data separation

We can create these databases as simply as CREATE DATABASE nameOfDataBase. We can also see what databases are already here with SHOW DATABASES. To delete a database we must type DROP DATABASE IF EXISTS nameOfDatabase CASCADE. Note the CASCADE part. This is because you can only delete empty tables and CASCADE will delete all the tables inside in turn.

All of these commands are used in the Hive CLI and these are known as DDL commands — Data Definition Language. These commands are used to define, build and modify data and the tables.

DDL — Add data

CREATE TABLE myFirstTable
( id INT, name STRING, city STRING, items ARRAY<STRING> )
PARTITIONED BY (age INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ':'
STORED AS TEXTFILE

This code is readable but is worth explaining. Our data file is delimited by | symbols, opposed to the typical , in CSV for example. We also know that the items are separated by :. We can change these as our data maybe different. We also note the stuff in brackets which lays out the data types for our table. Simple!

All simple except the Partition. Here we pick the data we want to split by and the data type it is. Note that age isn’t labelled in our data scheme as it is used in the partition. That is it.

A call for help.

Next question, where are these tables stored?

Managed tables fully controlled by Hive and hence are easy to navigate through.

The problem is that they can’t be shared outside of Hive, and as such it is necessary to make an external table if you want other applications to have access to the data. To do this CREATE EXTERNAL TABLE nameOfTable LOCATION location/of/db.

We have finished setting things up. That means we have finished defining things. That means we have finished with Data Definition Language (DDL).

We have defined, now lets manipulate

DML — Moving data in?

LOAD DATA LOCAL INPATH 'path/goes/here/data.dat' OVERWRITE INTO TABLE nameOfTable

Overwrite here is self explanatory, but if you don’t specify this Hive will just append the data.

If you don’t write LOCAL Hive assumes it is within the HDFS and will look for it.

If we have a data that needs to go into a particular partition:

LOAD DATA LOCAL INPATH 'path/goes/here/data.dat' OVERWRITE INTO TABLE nameOfTable PARTITION (condition - 'ofPartition')

With both examples, the file will be loaded into the HDFS area which Hive can control.

DML — Moving data out?

INSERT OVERWRITE LOCAL DIRECTORY 'please/put/it/here/foldername' 
SELECT sale_id, product, date
FROM sales
WHERE date='2019-01-01'

If you are moving large datasets you should never append to files as that cannot be done easily. Use OVERWRITE instead to speed things up. Be careful not to overwrite things you want.

DML — Give me the data

SELECT * FROM nameOfTable LIMIT 3

This will pull the first three rows of data in our table. We can be more specific too and demand for particular info types.

SELECT id, name FROM nameOfTable LIMIT 3

I want to be mooooorree specific though. Well you can also use WHERE.

SELECT id, name FROM nameOfTable 
WHERE name = 'acgoff' ;

We can set up a variety of if-esque statements. More info is here. But in short it has everything you would expect it to and more. Stackoverflow will help with any odd thing. LIKE and RLIKE are interesting ones to check out. Also it allows you to get information from arrays (search complex operator types). Oh and you also can create your own operators (but you have to write them in Java).

You can also order your results using GROUP BY:

SELECT id, name FROM nameOfTable 
WHERE name = 'acgoff'
GROUP BY city;

Partitions finally show up!

Join your data

SELECT c.ID, c.NAME, c.AGE, o.AMOUNT 
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

In this example you have a table called c and another called o. As quick as that you get multi-table responses.

If the ids don’t join there are a huge possibility of things to do — but this is above what I am going into. If you want them look at Map Joins or Bucket Map Joins.

You can also transform data types. For example, some numbers may come in as STRING but you want it as FLOAT. In your WHERE part of the query simply add WHERE cast(numberString AS FLOAT) = 1.5 (=1.5 is just an example). If the conversion fails (say the string has the word one instead of the value) it will be returned as Null.

What is a View?

Let’s Review

  • Allows you to build Databases, tables, partitions and buckets easily.
  • Allows you to move data in and out easily
  • Allows you to query the data easily
  • Means you don’t need to know Java to interact with Hadoop!

Focused on saving our time: everyhour.xyz for your life balance; tree-meals-a-day.earth for our food emissions and compairbnb.info/hello for booking airbnbs.