To get an accurate count of the amount of data, when you can't have any less or one more, COUNT(*)
is the only way. But there are times when you don't need an exact number, but you need a rough estimate of the table size, for example, to understand that the table is not empty, or to roughly estimate the size of the data to be migrated. There is a faster way than COUNT(*)
for such tasks. We can use Hive statistics.
Hive statistics
Hive can better optimize complex or multi-table queries if it has access to statistics about the amount of data and how the values are distributed. This information includes physical characteristics such as number of rows, number of data files, the total size of data files, and file format. For tables with partitions, the numbers are calculated for each partition and aggregated for the whole table.
This metadata is stored in the Hive metastore database and can be easily updated. If any statistic is unavailable, a value of -1
is used as a placeholder. Some numbers, such as the number and total size of data files, are always kept up to date because they can be computed cheaply as part of the HDFS block metadata collection.
There are many ways to use such statistics information:
- As already been mentioned Hive's cost-based optimizer uses statistics to generate a query plan for complex or multi-table queries;
- Users can also get data distribution details such as top 10 products sold, the age distribution in the customer's table, etc;
- Users can quickly get answers to some of their queries by referring only to stored statistics instead of running time-consuming execution plans.
There are two ways Hive table statistics are computed.
-
Autogathering statistics. For newly created tables and/or partitions, automatically computed by default. The user has to explicitly set the boolean variable
hive.stats.autogather
tofalse
so that statistics are not automatically computed and stored into Hive metastore. -
Collect statistics for existing tables using
ANALYZE
command. For existing tables and/or partitions, the user can issue theANALYZE
command to gather statistics and write them into Hive metastore.
When working with data in S3, ADLS, or WASB, the steps for analyzing tables are the same as when working with data in HDFS.
Fast Hive count
If you are looking for a quick way to get COUNT(*)
for a complete table or some of its partitions, just use EXPLAIN
directly before the query. HUE users can click on Explain button besides Execute button. It will not perform map-reduce and you can get counts within seconds.
This row count is taken from Hive table statistics and is NOT accurate.
hive > EXPLAIN SELECT COUNT(*) FROM flights;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: flights
Statistics: Num rows: 100000 Data size: 5100000 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
Statistics: Num rows: 100000 Data size: 5100000 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.294 seconds, Fetched: 21 row(s)
Look at the explain output and locate Num rows statistics row for the TableScan
operator. In the above explain output, table statistics shows the row count for the table is 100000 and table size in bytes is 5100000.
This will not work for queries other than simple COUNT(*)
from the table. It may work sometimes if you want to get a record count for certain partitions, but it will only work with partition columns.
In this case, you can add WHERE partition_column1 = value1 AND partition_column2 = value2
and see similar result.