hive

Topics related to hive:

Getting started with hive

  • Hive is a data warehouse tool built on top of Hadoop.
  • It provides an SQL-like language to query data.
  • We can run almost all the SQL queries in Hive, the only difference, is that, it runs a map-reduce job at the backend to fetch result from Hadoop Cluster. Because of this Hive sometimes take more time to fetch the result-set.

Insert Statement

insert overwrite
An insert overwrite statement deletes any existing files in the target table or partition before adding new files based off of the select statement used. Note that when there are structure changes to a table or to the DML used to load the table that sometimes the old files are not deleted. When loading to a table using dynamic partitioning only partitions defined by the select statement will be overwritten. Any preexisting partitions in the target will remain and will not be deleted.

insert into
An insert into statement appends new data into a target table based off of the select statement used.

Create Database and Table Statement

When working with tables and databases in HIVE. Below points can be usefull.

  • We can switch database using use database; command
  • To know the current working database we can get using SELECT current_database()
  • To see the DDL used for create table statement we can use SHOW CREATE TABLE tablename
  • To see all columns of table use DESCRIBE tablename to show extended details like location serde used and others DESCRIBE FORMATTED tablename. DESCRIBE can also be abbrevated as DESC.

SELECT Statement

File formats in HIVE

Hive User Defined Functions (UDF's)

Table Creation Script with sample data

User Defined Aggregate Functions (UDAF)

Indexing

User Defined Table Functions (UDTF's)

Export Data in Hive

Hive Table Creation Through Sqoop

Importing data directly from RDBMS to HIVE can solve lots of time. Also we can run a freeform query(a join or some simple query) and populate it in a table of our choice directly into Hive.

--hive-import tells Sqoop that the final destination is Hive and not HDFS.

--hive-table option helps in importing the data to the table in hive chosen by us, otherwise it will be named as the source table being imported from RDBMS.