Impala Tuning Summary

Architecture

componet	description
Impala Daemon	core Impala component,Plans and executes queries against HDFS, HBase, and Amazon S3 data
Statestore	checks on the health of all Impala daemons in a cluster, and continuously relays its findings to each of those daemons
Catalog Service	relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster

Performance Tuning

Best Practices

Choose the appropriate file format for the data
Avoid data ingestion processes that produce many small files
Choose partitioning granularity based on actual data volume
Use smallest appropriate integer types for partition key columns
Choose an appropriate Parquet block size, 256 MB block size for parquet
Gather statistics for all tables used in performance-critical or high-volume join queries
Minimize the overhead of transmitting results back to the client
Verify that your queries are planned in an efficient logical manner
Verify performance characteristics of queries
Use appropriate operating system settings

Join Tuning

Join order example

assume 4 tables: BIG, MEDIUM, SMALL, and TINY
join order: BIG, TINY, SMALL, MEDIUM

distributed join type

type	description
BROADCAST	The smaller table is broadcast to all nodes containing the larger table. It’s effective when one table is significantly smaller and can fit in memory across all nodes
PARTITIONED	Both tables are partitioned on the join key, ensuring that rows with the same key values end up on the same node. It’s used when joining two large tables
SHUFFLE	A shuffle join is performed when there’s no clear advantage to broadcasting or partitioning, or when it’s challenging to determine which table should be broadcasted. It involves redistributing the rows across nodes based on the join key

STRAIGHT_JOIN

1
2
3


select straight_join x from medium join small join (select * from big 
    where c1 < 10) as big
       where medium.id = small.id and small.id = big.id;

Statistics

Table and Column Statistics

1
2
3
4
5
6
7
8
9


SHOW TABLE STATS table_name;
SHOW COLUMN STATS table_name;
show partitions table_name;

-- hive
ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS;

-- optimize
COMPUTE STATS table_name;

limit for COMPUTE STATS

Limit the number of columns
more threads: MT_DOP
if 30% of the rows have changed then it is recommended to recompute statistics
COMPUTE STATS test_table TABLESAMPLE SYSTEM(10), 10% sample
COMPUTE INCREMENTAL STATS
COMPUTE INCREMENTAL STATS takes more time than COMPUTE STATS for the same volume of data
DROP INCREMENTAL STATS

Runtime Filtering

Build Phase: identifies candidate values from a large table (build side) for the join key,then constructs a bloom filter or hash table containing these values.
Probe Phase: sends the smaller table (probe side) to the nodes where the build side data resides, then filtered using the previously constructed bloom filter or hash table,ensures that only relevant rows are processed for the join operation
Dynamic Nature: runtime filters are dynamic and get applied during query execution.
best for parquet
wait intervals for runtime filter: RUNTIME_FILTER_WAIT_TIME_MS

other options

RUNTIME_FILTER_MODE
MAX_NUM_RUNTIME_FILTERS
DISABLE_ROW_RUNTIME_FILTERING
RUNTIME_FILTER_MAX_SIZE
RUNTIME_FILTER_MIN_SIZE
RUNTIME_BLOOM_FILTER_SIZE

Cache

HDFS cache

1

hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000

impala SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


-- Cache the entire table (all partitions).
alter table census set cached in 'pool_name';

-- Remove the entire table from the cache.
alter table census set uncached;

-- Cache a portion of the table (a single partition).
-- If the table is partitioned by multiple columns (such as year, month, day),
-- the ALTER TABLE command must specify values for all those columns.
alter table census partition (year=1960) set cached in 'pool_name';

-- Cache the data from one partition on up to 4 hosts, to minimize CPU load on any
-- single host when the same data block is processed multiple times.
alter table census partition (year=1970)
  set cached in 'pool_name' with replication = 4;

-- At each stage, check the volume of cached data.
-- For large tables or partitions, the background loading might take some time,
-- so you might have to wait and reissue the statement until all the data
-- has finished being loaded into the cache.
show table stats census;
+-------+-------+--------+------+--------------+--------+
| year  | #Rows | #Files | Size | Bytes Cached | Format |
+-------+-------+--------+------+--------------+--------+
| 1900  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| 1940  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| 1960  | -1    | 1      | 11B  | 11B          | TEXT   |
| 1970  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
| Total | -1    | 4      | 44B  | 11B          |        |
+-------+-------+--------+------+--------------+--------+

Detecting and Correcting HDFS Block Skew Conditions

use SUMMARY, PROFILE, Impala debug web U
find any Avg Time and a Max Time value in summary output
adjust the PARQUET_FILE_SIZE reducing the block size
use COMPRESSION_CODEC=NONE

Data Cache for Remote Reads
When Impala compute nodes and its storage are not co-located

1
2
3
4
5


--data_cache=/data/0,/data/1:500GB

LRU (Least Recently Used--the default)
LIRS (Inter-referenece Recency Set)
--data_cache_eviction_policy=policy

Troubleshooting

Fault Tolerance

RETRY_FAILED_QUERIES
Node Blacklisting

test I/O

1
2
3
4
5


sudo sysctl -w vm.drop_caches=3 vm.drop_caches=0
vm.drop_caches = 3
vm.drop_caches = 0

sudo dd if=/dev/sda bs=1M of=/dev/null count=1k

Symptom	Explanation	Recommendation
Impala takes a long time to start	Impala instances with large numbers of tables, partitions, or data files take longer to start because the metadata for these objects is broadcast to all impalad nodes and cached	Adjust timeout and synchronicity settings
Joins fail to complete	There may be insufficient memory. During a join, data from the second, third, and so on sets to be joined is loaded into memory. If Impala chooses an inefficient join order or join mechanism, the query could exceed the total memory available	Start by gathering statistics with the COMPUTE STATS statement for each table involved in the join. Consider specifying the [SHUFFLE] hint so that data from the joined tables is split up between nodes rather than broadcast to each node. If tuning at the SQL level is not sufficient, add more memory to your system or join smaller data sets
Queries return incorrect results	Impala metadata may be outdated after changes are performed in Hive	Where possible, use the appropriate Impala statement (INSERT, LOAD DATA, CREATE TABLE, ALTER TABLE, COMPUTE STATS, and so on) rather than switching back and forth between Impala and Hive. Impala automatically broadcasts the results of DDL and DML operations to all Impala nodes in the cluster, but does not automatically recognize when such changes are made through Hive. After inserting data, adding a partition, or other operation in Hive
Queries are slow to return results	Some impalad instances may not have started. Using a browser, connect to the host running the Impala state store	Ensure Impala is installed on all DataNodes. Start any impalad instances that are not running

Breakpad Minidumps

Dedicated Coordinators

1 coordinator for every 50 executors.
including CPU, the number of threads, the number of connections, and RPCs, under 80%.
Consider the following factors to determine the right number of coordinators in your cluster:

What is the number of concurrent queries?
What percentage of the workload is DDL?
What is the average query resource usage at the various stages (merge, runtime filter, result set size, etc.)?
How many Impala Daemons (impalad) is in the cluster?
Is there a high availability requirement?
Compute/storage capacity reduction factor

WEB UI

impalad, http://IP:25000

main page: version, hardware info, OS, process, cgroup
admission
backend
catalog,all database all tables
hadoop-varz
jmx
log_level
logs
memz
metrics
profile_docs
queries
rpcz
sessions
thredz
varz

statstored, http://IP:25010/

main page: version, hardware info, OS, process, cgroup
log_level
logs
memz
metrics
profile_docs
rpcz
subscribers
thredz
topics
varz

catalogd, http://IP:25020/

main page: version, hardware info, OS, process, cgroup
catalog, all database, all tables
jmx
log_level
logs
memz
metrics
operations
profile_docs
rpcz
thredz
varz

Administration

configuration

Setting Timeout

Increasing the Statestore Timeout
Setting the Idle Query and Idle Session Timeouts for impalad
Setting Timeout and Retries for Thrift Connections to the Backend Client
Cancelling a Query
Setting a Time Limit on Query Execution
Interactively Cancelling a Query

Proxy

Choosing the Load-Balancing Algorithm
Using Kerberos
TLS/SSL Enabled

shell

1

impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com

Deployment Limits

150 nodes in Impala 2.12 and higher
Number of Impalad Coordinators: 1 coordinator for at most every 50 executors
Number of roles: 10,000 for Ranger
Maximum number of columns in a query, included in a SELECT list, INSERT, and in an expression: no limit
Number of tables referenced: no limit
Number of plan nodes: no limit
Number of plan fragments: no limit
Depth of expression tree: 1000 hard limit
Width of expression tree: 10,000 hard limit
Codegen，Very deeply nested expressions within queries can exceed internal Impala limits

Spilling Data

to Disk
to HDFS
to Ozone

example

1
2
3
4
5


--scratch_dirs="/dir1:200GB:0, /dir2:1024GB:1, /dir3:1024GB:1, /dir4:1024GB:1"
--disk_spill_compression_codec=LZ4
--disk_spill_punch_holes=true

--scratch_dirs=hdfs://hdfsnn/tmp:300G,/local_buffer_dir:512M,/local_dir:10G

Metadata Management

Metadata on-demand mode
Mixed mode
Automatic Invalidation of Metadata Cache
Automatic Invalidation/Refresh of Metadata
Configure HMS for Event Based Automatic Metadata Sync

Security

Data and Log Files
Hive Metastore Database
Impala Web User Interface
TLS/SSL for Impala
Apache Ranger

Authorization,Privilege Model

1
2
3
4
5


Server
    URI
    Database
        Table
            Column

Authentication

Kerberos Authentication
LDAP Authentication
Multiple Authentication Methods
Impala Delegation for Clients

Format of the Audit Log Files

Client session state:
- Session ID
- User name
- Network address of the client connection
SQL statement details:
- Query ID
- Statement Type - DML, DDL, and so on
- SQL statement text
- Execution start time, in local time
- Execution Status - Details on any errors that were encountered
- Target Catalog Objects:
  - Object Type - Table, View, or Database
  - Fully qualified object name
  - Privilege

Lineage Information

Column Lineage
Lineage Data for Impala
-lineage_event_log_dir for impalad

Admission Control

parameters	desc	in config files
max memory	The maximum memory capacity of the resource pool is cluster-wide	fair-scheduler.xml, <queue> -> <maxResources>
maximum query memory	Maximum memory of a single node	impala.admission-control.max-query-mem-limit.root.default.regularPool
minimum query memory	Minimum memory for a single node	impala.admission-control.min-query-mem-limit.root.default.regularPool
max running queries	the maximum number of concurrent queries allowed by the resource pool	llama.am.throttling.maximum.placed.reservations.#QUEUE#
max queued queries	The maximum length of the queue, which is placed in the queue when the maximum parallelism is exceeded	llama.am.throttling.maximum.queued.reservations.#QUEUE#
queue timeout	The maximum waiting time in the queue	impala.admission-control.pool-queue-timeout-ms

fair-scheduler.xml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


<allocations>
    <queue name="root">
        <aclSubmitApps> </aclSubmitApps>
        <queue name="default">
                       <!-- Equivalent to the "max memory" parameter, which defines the maximum memory for the cluster scope -->
            <maxResources>40000 mb, 0 vcores, 0 disks</maxResources>
            <aclSubmitApps>*</aclSubmitApps>
        </queue>
        <queue name="dev">
                      <!-- Defines the maximum memory of a cluster-wide resource pool 80G -->
                      <!-- The user mapping relationship is also specified -->
            <maxResources>80000 mb, 0 vcores, 0 disks</maxResources>
            <aclSubmitApps>user1,user2 dev,tech,admin</aclSubmitApps>
        </queue>
    </queue>
    <queuePlacementPolicy>
        <rule name="specified" create="false"/>
        <rule name="default" />
    </queuePlacementPolicy>
</allocations>

llama-site.xml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


<?xml version="1.0" encoding="UTF-8"?>
<configuration>

<!--  Configuration of the first queue -->
<property>
    <name>llama.am.throttling.maximum.placed.reservations.root.default</name>
    <value>100</value>
</property>
<property>
    <name>llama.am.throttling.maximum.queued.reservations.root.default</name>
    <value>100</value>
</property>
<property>
    <name>impala.admission-control.pool-queue-timeout-ms.root.default</name>
    <value>5000</value>
</property>
<property>
    <name>impala.admission-control.pool-default-query-options.root.default</name>
    <value>mem_limit=2000m,query_timeout_s=20</value>
</property>



<!-- The configuration of another queue -->
<property>
    <name>llama.am.throttling.maximum.placed.reservations.root.dev</name>
    <value>50</value>
</property>
<property>
    <name>llama.am.throttling.maximum.queued.reservations.root.dev</name>
    <value>50</value>
</property>

<!-- Set the timeout period of the queue -->
<property>
    <name>impala.admission-control.pool-queue-timeout-ms.root.dev</name>
    <value>5000</value>
</property>
<property>
    <name>impala.admission-control.pool-default-query-options.root.dev</name>
    <value>mem_limit=4000m, query_timeout_s=40</value>
</property>
</configuration>

SQL

Statements and Data Type

ARRAY Complex Type
BIGINT Data Type
BOOLEAN Data Type
CHAR Data Type
DATE Data Type
DECIMAL Data Type
DOUBLE Data Type
FLOAT Data Type
INT Data Type
MAP Complex Type
REAL Data Type
SMALLINT Data Type
STRING Data Type
STRUCT Complex Type
TIMESTAMP Data Type
TINYINT Data Type
VARCHAR Data Type

DDL Statements

ALTER TABLE
ALTER VIEW
COMPUTE STATS
CREATE DATABASE
CREATE FUNCTION
CREATE ROLE
CREATE TABLE
CREATE VIEW
DROP DATABASE
DROP FUNCTION
DROP ROLE
DROP TABLE
DROP VIEW
GRANT
REVOKE
TRUNCATE TABLE
SHUTDOWN

DML Statements

DELETE
INSERT
LOAD DATA
UPDATE
UPSERT
SHOW
SELECT
REFRESH
REFRESH AUTHORIZATION
REFRESH FUNCTIONS

complex or nested types

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


-- What goes inside the < > for an ARRAY is a single type, either a scalar or another
-- complex type (ARRAY, STRUCT, or MAP).
CREATE TABLE array_t
(
  id BIGINT,
  a1 ARRAY <STRING>,
  a2 ARRAY <BIGINT>,
  a3 ARRAY <TIMESTAMP>,
  a4 ARRAY <STRUCT <f1: STRING, f2: INT, f3: BOOLEAN>>
)
STORED AS PARQUET;

-- What goes inside the < > for a MAP is two comma-separated types specifying the types of the key-value pair:
-- a scalar type representing the key, and a scalar or complex type representing the value.
CREATE TABLE map_t
(
  id BIGINT,
  m1 MAP <STRING, STRING>,
  m2 MAP <STRING, BIGINT>,
  m3 MAP <BIGINT, STRING>,
  m4 MAP <BIGINT, BIGINT>,
  m5 MAP <STRING, ARRAY <STRING>>
)
STORED AS PARQUET;

-- What goes inside the < > for a STRUCT is a comma-separated list of fields, each field defined as
-- name:type. The type can be a scalar or a complex type. The field names for each STRUCT do not clash
-- with the names of table columns or fields in other STRUCTs. A STRUCT is most often used inside
-- an ARRAY or a MAP rather than as a top-level column.
CREATE TABLE struct_t
(
  id BIGINT,
  s1 STRUCT <f1: STRING, f2: BIGINT>,
  s2 ARRAY <STRUCT <f1: INT, f2: TIMESTAMP>>,
  s3 MAP <BIGINT, STRUCT <name: STRING, birthday: TIMESTAMP>>
)
STORED AS PARQUET;

desc complex type

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48


desc array_t;
Query: describe array_t
+------+------------------+---------+
| name | type             | comment |
+------+------------------+---------+
| id   | bigint           |         |
| a1   | array<string>    |         |
| a2   | array<bigint>    |         |
| a3   | array<timestamp> |         |
| a4   | array<struct<    |         |
|      |   f1:string,     |         |
|      |   f2:int,        |         |
|      |   f3:boolean     |         |
|      | >>               |         |
+------+------------------+---------+

desc map_t;
Query: describe map_t
+------+---------------------------+---------+
| name | type                      | comment |
+------+---------------------------+---------+
| id   | bigint                    |         |
| m1   | map<string,string>        |         |
| m2   | map<string,bigint>        |         |
| m3   | map<bigint,string>        |         |
| m4   | map<bigint,bigint>        |         |
| m5   | map<string,array<string>> |         |
+------+---------------------------+---------+

desc struct_t;
Query: describe struct_t
+------+----------------------+---------+
| name | type                 | comment |
+------+----------------------+---------+
| id   | bigint               |         |
| s1   | struct<              |         |
|      |   f1:string,         |         |
|      |   f2:bigint          |         |
|      | >                    |         |
| s2   | array<struct<        |         |
|      |   f1:int,            |         |
|      |   f2:timestamp       |         |
|      | >>                   |         |
| s3   | map<bigint,struct<   |         |
|      |   name:string,       |         |
|      |   birthday:timestamp |         |
|      | >>                   |         |
+------+----------------------+---------+

query complex type:

1
2
3


SELECT s1.f1, s1.f2, s2.f1, s2.f2, s3.key, 
    s3.value.name, s3.value.birthday     
FROM struct_t, struct_t.s2, struct_t.s3

Functions

Built-In Functions

Mathematical Functions
Type Conversion Functions
Date and Time Functions
Conditional Functions
String Functions
Window Functions
Analytic Functions
Bit Functions
Miscellaneous Functions

User-Defined Functions (UDFs)

1
2
3
4
5
6
7
8


create function my_lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';

select my_lower('Some String NOT ALREADY LOWERCASE');
+----------------------------------------------------+
| udfs.my_lower('some string not already lowercase') |
+----------------------------------------------------+
| some string not already lowercase                  |
+----------------------------------------------------+

c++ code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


#ifndef IMPALA_UDF_SAMPLE_UDF_H
#define IMPALA_UDF_SAMPLE_UDF_H

#include <impala_udf/udf.h>

using namespace impala_udf;

IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2);

#endif

c++ code

1
2
3
4
5
6
7
8
9


#include "udf-sample.h"

// In this sample we are declaring a UDF that adds two ints and returns an int.
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2) {
  if (arg1.is_null || arg2.is_null) return IntVal::null();
  return IntVal(arg1.val + arg2.val);
}

// Multiple UDFs can be defined in the same file

other c++ example, build:

1

$ hdfs dfs -put ./udf_samples/libudfsample.so /user/hive/udfs/libudfsample.so

sql:

1
2
3
4
5
6
7
8


create function has_vowels (string) returns boolean location '/user/hive/udfs/libudfsample.so' symbol='HasVowels';

select has_vowels('abc');
+------------------------+
| udfs.has_vowels('abc') |
+------------------------+
| true                   |
+------------------------+

UDAF

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


create aggregate function sum_of_squares(bigint) returns bigint
  > location '/user/hive/udfs/libudasample.so'
  > init_fn='SumOfSquaresInit'
  > update_fn='SumOfSquaresUpdate'
  > merge_fn='SumOfSquaresMerge'
  > finalize_fn='SumOfSquaresFinalize';
  
select 1*1 + 2*2 + 3*3 + 4*4;
+-------------------------------+
| 1 * 1 + 2 * 2 + 3 * 3 + 4 * 4 |
+-------------------------------+
| 30                            |
+-------------------------------+  

Compatibility with Hive

The current release of Impala does not support the following SQL features that you might be familiar with from HiveQL

Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes.
The DATE data type.
XML functions.
Certain aggregate functions from HiveQL: covar_pop, covar_samp, corr, percentile, percentile_approx, histogram_numeric, collect_set; Impala supports the set of aggregate functions listed in Impala Aggregate Functions and analytic functions listed in Impala Analytic Functions.
Sampling
Lateral views

Impala does not currently support these HiveQL statements

ANALYZE TABLE (the Impala equivalent is COMPUTE STATS)
DESCRIBE COLUMN
DESCRIBE DATABASE
EXPORT TABLE
IMPORT TABLE
SHOW TABLE EXTENDED
SHOW TBLPROPERTIES
SHOW INDEXES
SHOW COLUMNS
INSERT OVERWRITE DIRECTORY;

Expalin

explain

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


explain select count(*) from hudi01;
Query: explain select count(*) from hudi01
+------------------------------------------------------------+
| Explain String                                             |
+------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=8.00KB Threads=2 |
| Per-Host Resource Estimates: Memory=10MB                   |
| Codegen disabled by planner                                |
|                                                            |
| PLAN-ROOT SINK                                             |
| |                                                          |
| 01:AGGREGATE [FINALIZE]                                    |
| |  output: sum_init_zero(hudi_test.hudi01.stats: num_rows) |
| |  row-size=8B cardinality=1                               |
| |                                                          |
| 00:SCAN HDFS [hudi_test.hudi01]                            |
|    HDFS partitions=1/1 files=5 size=2.24KB                 |
|    row-size=8B cardinality=5                               |
+------------------------------------------------------------+

summary

1
2
3
4
5
6
7
8


summary;
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+------------------+
| Operator     | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail           |
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+------------------+
| F00:ROOT     | 1      | 1     | 0ns      | 0ns      |       |            | 0 B      | 0 B           |                  |
| 01:AGGREGATE | 1      | 1     | 1.00ms   | 1.00ms   | 1     | 1          | 20.00 KB | 16.00 KB      | FINALIZE         |
| 00:SCAN HDFS | 1      | 1     | 29.00ms  | 29.00ms  | 5     | 5          | 28.00 KB | 1.00 MB       | hudi_test.hudi01 |
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+------------------+

show table stats

1
2
3
4
5
6
7


SHOW TABLE STATS hudi01;
Query: SHOW TABLE STATS hudi01
+-------+--------+--------+--------------+-------------------+---------+-------------------+-------------------------------------------------------------+-----------+
| #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                    | EC Policy |
+-------+--------+--------+--------------+-------------------+---------+-------------------+-------------------------------------------------------------+-----------+
| 32    | 5      | 2.24KB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://nameservice1/apps/hive/warehouse/hudi_test.db/hudi01 | NONE      |
+-------+--------+----

show column

1
2
3
4
5
6
7
8


SHOW COLUMN STATS hudi01;
Query: SHOW COLUMN STATS hudi01
+--------+--------+------------------+--------+----------+----------+--------+---------+
| Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size | #Trues | #Falses |
+--------+--------+------------------+--------+----------+----------+--------+---------+
| id     | INT    | -1               | -1     | 4        | 4.0      | -1     | -1      |
| name   | STRING | -1               | -1     | -1       | -1.0     | -1     | -1      |
+--------+--------+------------------+--------+----------+----------+--------+---------+

show partitions

1
2
3
4
5
6


SHOW PARTITIONS sales_data
+-------+-------+-------+--------+------+--------------+-------------------+--------+-------------------+----------+-----------+
| year  | month | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | EC Policy |
+-------+-------+-------+--------+------+--------------+-------------------+--------+-------------------+----------+-----------+
| Total |       | -1    | 0      | 0B   | 0B           |                   |        |                   |          |           |
+-------+-------+-------+--------+------+--------------+-------------------+--------+-------------------+----------+-----------+

File Formats

File Type	Format	Compression Codecs	Impala Can CREATE?	Impala Can INSERT?
Parquet	Structured	Snappy, gzip, zstd, lz4; currently Snappy by default	Yes.	Yes: CREATE TABLE, INSERT, LOAD DATA, and query
ORC	Structured	gzip, Snappy, LZO, LZ4; currently gzip by default	Yes, in Impala 2.12.0 and higher.By default, ORC reads are enabled in Impala 3.4.0 and higher	No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala.
Text	Unstructured	bzip2, deflate, gzip, LZO, Snappy, zstd	Yes. For CREATE TABLE with no STORED AS clause	If LZO compression is used, you must create the table and load data in Hive.
Avro	Structured	Snappy, gzip, deflate	Yes, in Impala 1.4.0 and higher. In lower versions, create the table using Hive	No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala
Hudi	Structured	Snappy, gzip, zstd, lz4; currently Snappy by default	Yes, support for Read Optimized Queries is experimental	No. Create an external table in Impala. Set the table location to the Hudi table directory. Alternatively, create the Hudi table in Hive
RCFile	Structured	Snappy, gzip, deflate, bzip2	Yes	No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala
SequenceFile	Structured	Snappy, gzip, deflate, bzip2	Yes	No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala

Supported table and storage

Kudu Tables
HBase Tables
Iceberg Tables
S3 Tables
Azure Data Lake Store
Isilon Storage
Ozone Storage

for ice-berg

1
2
3
4
5


CREATE EXTERNAL TABLE ice_hadoop_cat
STORED AS ICEBERG
TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
              'iceberg.catalog_location'='/path/to/catalog',
              'iceberg.table_identifier'='namespace.table');