Architecture

componet description
Impala Daemon core Impala component,Plans and executes queries against HDFS, HBase, and Amazon S3 data
Statestore checks on the health of all Impala daemons in a cluster, and continuously relays its findings to each of those daemons
Catalog Service relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster

Performance Tuning

Best Practices

  • Choose the appropriate file format for the data
  • Avoid data ingestion processes that produce many small files
  • Choose partitioning granularity based on actual data volume
  • Use smallest appropriate integer types for partition key columns
  • Choose an appropriate Parquet block size, 256 MB block size for parquet
  • Gather statistics for all tables used in performance-critical or high-volume join queries
  • Minimize the overhead of transmitting results back to the client
  • Verify that your queries are planned in an efficient logical manner
  • Verify performance characteristics of queries
  • Use appropriate operating system settings

Join Tuning

Join order example

  • assume 4 tables: BIG, MEDIUM, SMALL, and TINY
  • join order: BIG, TINY, SMALL, MEDIUM

distributed join type

type description
BROADCAST The smaller table is broadcast to all nodes containing the larger table. It’s effective when one table is significantly smaller and can fit in memory across all nodes
PARTITIONED Both tables are partitioned on the join key, ensuring that rows with the same key values end up on the same node. It’s used when joining two large tables
SHUFFLE A shuffle join is performed when there’s no clear advantage to broadcasting or partitioning, or when it’s challenging to determine which table should be broadcasted. It involves redistributing the rows across nodes based on the join key

STRAIGHT_JOIN

1
2
3
select straight_join x from medium join small join (select * from big 
    where c1 < 10) as big
       where medium.id = small.id and small.id = big.id;

Statistics

Table and Column Statistics

1
2
3
4
5
6
7
8
9
SHOW TABLE STATS table_name;
SHOW COLUMN STATS table_name;
show partitions table_name;

-- hive
ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS;

-- optimize
COMPUTE STATS table_name;

limit for COMPUTE STATS

  • Limit the number of columns
  • more threads: MT_DOP
  • if 30% of the rows have changed then it is recommended to recompute statistics
  • COMPUTE STATS test_table TABLESAMPLE SYSTEM(10), 10% sample
  • COMPUTE INCREMENTAL STATS
  • COMPUTE INCREMENTAL STATS takes more time than COMPUTE STATS for the same volume of data
  • DROP INCREMENTAL STATS

Runtime Filtering

  • Build Phase: identifies candidate values from a large table (build side) for the join key,then constructs a bloom filter or hash table containing these values.
  • Probe Phase: sends the smaller table (probe side) to the nodes where the build side data resides, then filtered using the previously constructed bloom filter or hash table,ensures that only relevant rows are processed for the join operation
  • Dynamic Nature: runtime filters are dynamic and get applied during query execution.
  • best for parquet
  • wait intervals for runtime filter: RUNTIME_FILTER_WAIT_TIME_MS

other options

  • RUNTIME_FILTER_MODE
  • MAX_NUM_RUNTIME_FILTERS
  • DISABLE_ROW_RUNTIME_FILTERING
  • RUNTIME_FILTER_MAX_SIZE
  • RUNTIME_FILTER_MIN_SIZE
  • RUNTIME_BLOOM_FILTER_SIZE

Cache

HDFS cache

1
hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000

impala SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Cache the entire table (all partitions).
alter table census set cached in 'pool_name';

-- Remove the entire table from the cache.
alter table census set uncached;

-- Cache a portion of the table (a single partition).
-- If the table is partitioned by multiple columns (such as year, month, day),
-- the ALTER TABLE command must specify values for all those columns.
alter table census partition (year=1960) set cached in 'pool_name';

-- Cache the data from one partition on up to 4 hosts, to minimize CPU load on any
-- single host when the same data block is processed multiple times.
alter table census partition (year=1970)
  set cached in 'pool_name' with replication = 4;

-- At each stage, check the volume of cached data.
-- For large tables or partitions, the background loading might take some time,
-- so you might have to wait and reissue the statement until all the data
-- has finished being loaded into the cache.
show table stats census;
+-------+-------+--------+------+--------------+--------+
| year  | #Rows | #Files | Size | Bytes Cached | Format |
+-------+-------+--------+------+--------------+--------+
| 1900  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| 1940  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| 1960  | -1    | 1      | 11B  | 11B          | TEXT   |
| 1970  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| Total | -1    | 4      | 44B  | 11B          |        |
+-------+-------+--------+------+--------------+--------+

Detecting and Correcting HDFS Block Skew Conditions

  • use SUMMARY, PROFILE, Impala debug web U
  • find any Avg Time and a Max Time value in summary output
  • adjust the PARQUET_FILE_SIZE reducing the block size
  • use COMPRESSION_CODEC=NONE

Data Cache for Remote Reads
When Impala compute nodes and its storage are not co-located

1
2
3
4
5
--data_cache=/data/0,/data/1:500GB

LRU (Least Recently Used--the default)
LIRS (Inter-referenece Recency Set)
--data_cache_eviction_policy=policy

Troubleshooting

Fault Tolerance

  • RETRY_FAILED_QUERIES
  • Node Blacklisting

test I/O

1
2
3
4
5
sudo sysctl -w vm.drop_caches=3 vm.drop_caches=0
vm.drop_caches = 3
vm.drop_caches = 0

sudo dd if=/dev/sda bs=1M of=/dev/null count=1k
Symptom Explanation Recommendation
Impala takes a long time to start Impala instances with large numbers of tables, partitions, or data files take longer to start because the metadata for these objects is broadcast to all impalad nodes and cached Adjust timeout and synchronicity settings
Joins fail to complete There may be insufficient memory. During a join, data from the second, third, and so on sets to be joined is loaded into memory. If Impala chooses an inefficient join order or join mechanism, the query could exceed the total memory available Start by gathering statistics with the COMPUTE STATS statement for each table involved in the join. Consider specifying the [SHUFFLE] hint so that data from the joined tables is split up between nodes rather than broadcast to each node. If tuning at the SQL level is not sufficient, add more memory to your system or join smaller data sets
Queries return incorrect results Impala metadata may be outdated after changes are performed in Hive Where possible, use the appropriate Impala statement (INSERT, LOAD DATA, CREATE TABLE, ALTER TABLE, COMPUTE STATS, and so on) rather than switching back and forth between Impala and Hive. Impala automatically broadcasts the results of DDL and DML operations to all Impala nodes in the cluster, but does not automatically recognize when such changes are made through Hive. After inserting data, adding a partition, or other operation in Hive
Queries are slow to return results Some impalad instances may not have started. Using a browser, connect to the host running the Impala state store Ensure Impala is installed on all DataNodes. Start any impalad instances that are not running

Breakpad Minidumps

Dedicated Coordinators

1 coordinator for every 50 executors.
including CPU, the number of threads, the number of connections, and RPCs, under 80%.
Consider the following factors to determine the right number of coordinators in your cluster:

  • What is the number of concurrent queries?
  • What percentage of the workload is DDL?
  • What is the average query resource usage at the various stages (merge, runtime filter, result set size, etc.)?
  • How many Impala Daemons (impalad) is in the cluster?
  • Is there a high availability requirement?
  • Compute/storage capacity reduction factor

WEB UI

impalad, http://IP:25000

  • main page: version, hardware info, OS, process, cgroup
  • admission
  • backend
  • catalog,all database all tables
  • hadoop-varz
  • jmx
  • log_level
  • logs
  • memz
  • metrics
  • profile_docs
  • queries
  • rpcz
  • sessions
  • thredz
  • varz

statstored, http://IP:25010/

  • main page: version, hardware info, OS, process, cgroup
  • log_level
  • logs
  • memz
  • metrics
  • profile_docs
  • rpcz
  • subscribers
  • thredz
  • topics
  • varz

catalogd, http://IP:25020/

  • main page: version, hardware info, OS, process, cgroup
  • catalog, all database, all tables
  • jmx
  • log_level
  • logs
  • memz
  • metrics
  • operations
  • profile_docs
  • rpcz
  • thredz
  • varz

Administration

configuration

Setting Timeout

  • Increasing the Statestore Timeout
  • Setting the Idle Query and Idle Session Timeouts for impalad
  • Setting Timeout and Retries for Thrift Connections to the Backend Client
  • Cancelling a Query
  • Setting a Time Limit on Query Execution
  • Interactively Cancelling a Query

Proxy

  • Choosing the Load-Balancing Algorithm
  • Using Kerberos
  • TLS/SSL Enabled

shell

1
impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com

Deployment Limits

  • 150 nodes in Impala 2.12 and higher
  • Number of Impalad Coordinators: 1 coordinator for at most every 50 executors
  • Number of roles: 10,000 for Ranger
  • Maximum number of columns in a query, included in a SELECT list, INSERT, and in an expression: no limit
  • Number of tables referenced: no limit
  • Number of plan nodes: no limit
  • Number of plan fragments: no limit
  • Depth of expression tree: 1000 hard limit
  • Width of expression tree: 10,000 hard limit
  • Codegen,Very deeply nested expressions within queries can exceed internal Impala limits

Spilling Data

  • to Disk
  • to HDFS
  • to Ozone

example

1
2
3
4
5
--scratch_dirs="/dir1:200GB:0, /dir2:1024GB:1, /dir3:1024GB:1, /dir4:1024GB:1"
--disk_spill_compression_codec=LZ4
--disk_spill_punch_holes=true

--scratch_dirs=hdfs://hdfsnn/tmp:300G,/local_buffer_dir:512M,/local_dir:10G

Metadata Management

  • Metadata on-demand mode
  • Mixed mode
  • Automatic Invalidation of Metadata Cache
  • Automatic Invalidation/Refresh of Metadata
  • Configure HMS for Event Based Automatic Metadata Sync

Security

Data and Log Files
Hive Metastore Database
Impala Web User Interface
TLS/SSL for Impala
Apache Ranger

Authorization,Privilege Model

1
2
3
4
5
Server
    URI
    Database
        Table
            Column

Authentication

  • Kerberos Authentication
  • LDAP Authentication
  • Multiple Authentication Methods
  • Impala Delegation for Clients

Format of the Audit Log Files

  • Client session state:
    • Session ID
    • User name
    • Network address of the client connection
  • SQL statement details:
    • Query ID
    • Statement Type - DML, DDL, and so on
    • SQL statement text
    • Execution start time, in local time
    • Execution Status - Details on any errors that were encountered
    • Target Catalog Objects:
      • Object Type - Table, View, or Database
      • Fully qualified object name
      • Privilege

Lineage Information

  • Column Lineage
  • Lineage Data for Impala
  • -lineage_event_log_dir for impalad

Admission Control

parameters desc in config files
max memory The maximum memory capacity of the resource pool is cluster-wide fair-scheduler.xml, <queue> -> <maxResources>
maximum query memory Maximum memory of a single node impala.admission-control.max-query-mem-limit.root.default.regularPool
minimum query memory Minimum memory for a single node impala.admission-control.min-query-mem-limit.root.default.regularPool
max running queries the maximum number of concurrent queries allowed by the resource pool llama.am.throttling.maximum.placed.reservations.#QUEUE#
max queued queries The maximum length of the queue, which is placed in the queue when the maximum parallelism is exceeded llama.am.throttling.maximum.queued.reservations.#QUEUE#
queue timeout The maximum waiting time in the queue impala.admission-control.pool-queue-timeout-ms

fair-scheduler.xml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
<allocations>
    <queue name="root">
        <aclSubmitApps> </aclSubmitApps>
        <queue name="default">
                       <!-- Equivalent to the "max memory" parameter, which defines the maximum memory for the cluster scope -->
            <maxResources>40000 mb, 0 vcores, 0 disks</maxResources>
            <aclSubmitApps>*</aclSubmitApps>
        </queue>
        <queue name="dev">
                      <!-- Defines the maximum memory of a cluster-wide resource pool 80G -->
                      <!-- The user mapping relationship is also specified -->
            <maxResources>80000 mb, 0 vcores, 0 disks</maxResources>
            <aclSubmitApps>user1,user2 dev,tech,admin</aclSubmitApps>
        </queue>
    </queue>
    <queuePlacementPolicy>
        <rule name="specified" create="false"/>
        <rule name="default" />
    </queuePlacementPolicy>
</allocations>

llama-site.xml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
<?xml version="1.0" encoding="UTF-8"?>
<configuration>

<!--  Configuration of the first queue -->
<property>
    <name>llama.am.throttling.maximum.placed.reservations.root.default</name>
    <value>100</value>
</property>
<property>
    <name>llama.am.throttling.maximum.queued.reservations.root.default</name>
    <value>100</value>
</property>
<property>
    <name>impala.admission-control.pool-queue-timeout-ms.root.default</name>
    <value>5000</value>
</property>
<property>
    <name>impala.admission-control.pool-default-query-options.root.default</name>
    <value>mem_limit=2000m,query_timeout_s=20</value>
</property>



<!-- The configuration of another queue -->
<property>
    <name>llama.am.throttling.maximum.placed.reservations.root.dev</name>
    <value>50</value>
</property>
<property>
    <name>llama.am.throttling.maximum.queued.reservations.root.dev</name>
    <value>50</value>
</property>

<!-- Set the timeout period of the queue -->
<property>
    <name>impala.admission-control.pool-queue-timeout-ms.root.dev</name>
    <value>5000</value>
</property>
<property>
    <name>impala.admission-control.pool-default-query-options.root.dev</name>
    <value>mem_limit=4000m, query_timeout_s=40</value>
</property>
</configuration>

SQL

Statements and Data Type

  • ARRAY Complex Type
  • BIGINT Data Type
  • BOOLEAN Data Type
  • CHAR Data Type
  • DATE Data Type
  • DECIMAL Data Type
  • DOUBLE Data Type
  • FLOAT Data Type
  • INT Data Type
  • MAP Complex Type
  • REAL Data Type
  • SMALLINT Data Type
  • STRING Data Type
  • STRUCT Complex Type
  • TIMESTAMP Data Type
  • TINYINT Data Type
  • VARCHAR Data Type

DDL Statements

  • ALTER TABLE
  • ALTER VIEW
  • COMPUTE STATS
  • CREATE DATABASE
  • CREATE FUNCTION
  • CREATE ROLE
  • CREATE TABLE
  • CREATE VIEW
  • DROP DATABASE
  • DROP FUNCTION
  • DROP ROLE
  • DROP TABLE
  • DROP VIEW
  • GRANT
  • REVOKE
  • TRUNCATE TABLE
  • SHUTDOWN

DML Statements

  • DELETE
  • INSERT
  • LOAD DATA
  • UPDATE
  • UPSERT
  • SHOW
  • SELECT
  • REFRESH
  • REFRESH AUTHORIZATION
  • REFRESH FUNCTIONS

complex or nested types

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- What goes inside the < > for an ARRAY is a single type, either a scalar or another
-- complex type (ARRAY, STRUCT, or MAP).
CREATE TABLE array_t
(
  id BIGINT,
  a1 ARRAY <STRING>,
  a2 ARRAY <BIGINT>,
  a3 ARRAY <TIMESTAMP>,
  a4 ARRAY <STRUCT <f1: STRING, f2: INT, f3: BOOLEAN>>
)
STORED AS PARQUET;

-- What goes inside the < > for a MAP is two comma-separated types specifying the types of the key-value pair:
-- a scalar type representing the key, and a scalar or complex type representing the value.
CREATE TABLE map_t
(
  id BIGINT,
  m1 MAP <STRING, STRING>,
  m2 MAP <STRING, BIGINT>,
  m3 MAP <BIGINT, STRING>,
  m4 MAP <BIGINT, BIGINT>,
  m5 MAP <STRING, ARRAY <STRING>>
)
STORED AS PARQUET;

-- What goes inside the < > for a STRUCT is a comma-separated list of fields, each field defined as
-- name:type. The type can be a scalar or a complex type. The field names for each STRUCT do not clash
-- with the names of table columns or fields in other STRUCTs. A STRUCT is most often used inside
-- an ARRAY or a MAP rather than as a top-level column.
CREATE TABLE struct_t
(
  id BIGINT,
  s1 STRUCT <f1: STRING, f2: BIGINT>,
  s2 ARRAY <STRUCT <f1: INT, f2: TIMESTAMP>>,
  s3 MAP <BIGINT, STRUCT <name: STRING, birthday: TIMESTAMP>>
)
STORED AS PARQUET;

desc complex type

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
desc array_t;
Query: describe array_t
+------+------------------+---------+
| name | type             | comment |
+------+------------------+---------+
| id   | bigint           |         |
| a1   | array<string>    |         |
| a2   | array<bigint>    |         |
| a3   | array<timestamp> |         |
| a4   | array<struct<    |         |
|      |   f1:string,     |         |
|      |   f2:int,        |         |
|      |   f3:boolean     |         |
|      | >>               |         |
+------+------------------+---------+

desc map_t;
Query: describe map_t
+------+---------------------------+---------+
| name | type                      | comment |
+------+---------------------------+---------+
| id   | bigint                    |         |
| m1   | map<string,string>        |         |
| m2   | map<string,bigint>        |         |
| m3   | map<bigint,string>        |         |
| m4   | map<bigint,bigint>        |         |
| m5   | map<string,array<string>> |         |
+------+---------------------------+---------+

desc struct_t;
Query: describe struct_t
+------+----------------------+---------+
| name | type                 | comment |
+------+----------------------+---------+
| id   | bigint               |         |
| s1   | struct<              |         |
|      |   f1:string,         |         |
|      |   f2:bigint          |         |
|      | >                    |         |
| s2   | array<struct<        |         |
|      |   f1:int,            |         |
|      |   f2:timestamp       |         |
|      | >>                   |         |
| s3   | map<bigint,struct<   |         |
|      |   name:string,       |         |
|      |   birthday:timestamp |         |
|      | >>                   |         |
+------+----------------------+---------+

query complex type:

1
2
3
SELECT s1.f1, s1.f2, s2.f1, s2.f2, s3.key, 
    s3.value.name, s3.value.birthday     
FROM struct_t, struct_t.s2, struct_t.s3

Functions

Built-In Functions

  • Mathematical Functions
  • Type Conversion Functions
  • Date and Time Functions
  • Conditional Functions
  • String Functions
  • Window Functions
  • Analytic Functions
  • Bit Functions
  • Miscellaneous Functions

User-Defined Functions (UDFs)

1
2
3
4
5
6
7
8
create function my_lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';

select my_lower('Some String NOT ALREADY LOWERCASE');
+----------------------------------------------------+
| udfs.my_lower('some string not already lowercase') |
+----------------------------------------------------+
| some string not already lowercase                  |
+----------------------------------------------------+

c++ code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#ifndef IMPALA_UDF_SAMPLE_UDF_H
#define IMPALA_UDF_SAMPLE_UDF_H

#include <impala_udf/udf.h>

using namespace impala_udf;

IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2);

#endif

c++ code

1
2
3
4
5
6
7
8
9
#include "udf-sample.h"

// In this sample we are declaring a UDF that adds two ints and returns an int.
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2) {
  if (arg1.is_null || arg2.is_null) return IntVal::null();
  return IntVal(arg1.val + arg2.val);
}

// Multiple UDFs can be defined in the same file

other c++ example, build:

1
$ hdfs dfs -put ./udf_samples/libudfsample.so /user/hive/udfs/libudfsample.so

sql:

1
2
3
4
5
6
7
8
create function has_vowels (string) returns boolean location '/user/hive/udfs/libudfsample.so' symbol='HasVowels';

select has_vowels('abc');
+------------------------+
| udfs.has_vowels('abc') |
+------------------------+
| true                   |
+------------------------+

UDAF

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
create aggregate function sum_of_squares(bigint) returns bigint
  > location '/user/hive/udfs/libudasample.so'
  > init_fn='SumOfSquaresInit'
  > update_fn='SumOfSquaresUpdate'
  > merge_fn='SumOfSquaresMerge'
  > finalize_fn='SumOfSquaresFinalize';
  
select 1*1 + 2*2 + 3*3 + 4*4;
+-------------------------------+
| 1 * 1 + 2 * 2 + 3 * 3 + 4 * 4 |
+-------------------------------+
| 30                            |
+-------------------------------+  

Compatibility with Hive

The current release of Impala does not support the following SQL features that you might be familiar with from HiveQL

  • Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes.
  • The DATE data type.
  • XML functions.
  • Certain aggregate functions from HiveQL: covar_pop, covar_samp, corr, percentile, percentile_approx, histogram_numeric, collect_set; Impala supports the set of aggregate functions listed in Impala Aggregate Functions and analytic functions listed in Impala Analytic Functions.
  • Sampling
  • Lateral views

Impala does not currently support these HiveQL statements

  • ANALYZE TABLE (the Impala equivalent is COMPUTE STATS)
  • DESCRIBE COLUMN
  • DESCRIBE DATABASE
  • EXPORT TABLE
  • IMPORT TABLE
  • SHOW TABLE EXTENDED
  • SHOW TBLPROPERTIES
  • SHOW INDEXES
  • SHOW COLUMNS
  • INSERT OVERWRITE DIRECTORY;

Expalin

explain

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
explain select count(*) from hudi01;
Query: explain select count(*) from hudi01
+------------------------------------------------------------+
| Explain String                                             |
+------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=8.00KB Threads=2 |
| Per-Host Resource Estimates: Memory=10MB                   |
| Codegen disabled by planner                                |
|                                                            |
| PLAN-ROOT SINK                                             |
| |                                                          |
| 01:AGGREGATE [FINALIZE]                                    |
| |  output: sum_init_zero(hudi_test.hudi01.stats: num_rows) |
| |  row-size=8B cardinality=1                               |
| |                                                          |
| 00:SCAN HDFS [hudi_test.hudi01]                            |
|    HDFS partitions=1/1 files=5 size=2.24KB                 |
|    row-size=8B cardinality=5                               |
+------------------------------------------------------------+

summary

1
2
3
4
5
6
7
8
summary;
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+------------------+
| Operator     | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail           |
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+------------------+
| F00:ROOT     | 1      | 1     | 0ns      | 0ns      |       |            | 0 B      | 0 B           |                  |
| 01:AGGREGATE | 1      | 1     | 1.00ms   | 1.00ms   | 1     | 1          | 20.00 KB | 16.00 KB      | FINALIZE         |
| 00:SCAN HDFS | 1      | 1     | 29.00ms  | 29.00ms  | 5     | 5          | 28.00 KB | 1.00 MB       | hudi_test.hudi01 |
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+------------------+

show table stats

1
2
3
4
5
6
7
SHOW TABLE STATS hudi01;
Query: SHOW TABLE STATS hudi01
+-------+--------+--------+--------------+-------------------+---------+-------------------+-------------------------------------------------------------+-----------+
| #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                    | EC Policy |
+-------+--------+--------+--------------+-------------------+---------+-------------------+-------------------------------------------------------------+-----------+
| 32    | 5      | 2.24KB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://nameservice1/apps/hive/warehouse/hudi_test.db/hudi01 | NONE      |
+-------+--------+----

show column

1
2
3
4
5
6
7
8
SHOW COLUMN STATS hudi01;
Query: SHOW COLUMN STATS hudi01
+--------+--------+------------------+--------+----------+----------+--------+---------+
| Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size | #Trues | #Falses |
+--------+--------+------------------+--------+----------+----------+--------+---------+
| id     | INT    | -1               | -1     | 4        | 4.0      | -1     | -1      |
| name   | STRING | -1               | -1     | -1       | -1.0     | -1     | -1      |
+--------+--------+------------------+--------+----------+----------+--------+---------+

show partitions

1
2
3
4
5
6
SHOW PARTITIONS sales_data
+-------+-------+-------+--------+------+--------------+-------------------+--------+-------------------+----------+-----------+
| year  | month | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | EC Policy |
+-------+-------+-------+--------+------+--------------+-------------------+--------+-------------------+----------+-----------+
| Total |       | -1    | 0      | 0B   | 0B           |                   |        |                   |          |           |
+-------+-------+-------+--------+------+--------------+-------------------+--------+-------------------+----------+-----------+

File Formats

File Type Format Compression Codecs Impala Can CREATE? Impala Can INSERT?
Parquet Structured Snappy, gzip, zstd, lz4; currently Snappy by default Yes. Yes: CREATE TABLE, INSERT, LOAD DATA, and query
ORC Structured gzip, Snappy, LZO, LZ4; currently gzip by default Yes, in Impala 2.12.0 and higher.By default, ORC reads are enabled in Impala 3.4.0 and higher No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala.
Text Unstructured bzip2, deflate, gzip, LZO, Snappy, zstd Yes. For CREATE TABLE with no STORED AS clause If LZO compression is used, you must create the table and load data in Hive.
Avro Structured Snappy, gzip, deflate Yes, in Impala 1.4.0 and higher. In lower versions, create the table using Hive No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala
Hudi Structured Snappy, gzip, zstd, lz4; currently Snappy by default Yes, support for Read Optimized Queries is experimental No. Create an external table in Impala. Set the table location to the Hudi table directory. Alternatively, create the Hudi table in Hive
RCFile Structured Snappy, gzip, deflate, bzip2 Yes No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala
SequenceFile Structured Snappy, gzip, deflate, bzip2 Yes No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala

Supported table and storage

  • Kudu Tables
  • HBase Tables
  • Iceberg Tables
  • S3 Tables
  • Azure Data Lake Store
  • Isilon Storage
  • Ozone Storage

for ice-berg

1
2
3
4
5
CREATE EXTERNAL TABLE ice_hadoop_cat
STORED AS ICEBERG
TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
              'iceberg.catalog_location'='/path/to/catalog',
              'iceberg.table_identifier'='namespace.table');

Reference