To get an accurate count of the amount of data, when you can't have any less or one more,
COUNT(*) is the only way. But there are times when you don't need an exact number, but you need a rough estimate of the table size, for example, to understand that the table is not empty, or to roughly estimate the size of the data to be migrated. There is a faster way than
COUNT(*) for such tasks. We can use Hive statistics.
Hive can better optimize complex or multi-table queries if it has access to statistics about the amount of data and how the values are distributed. This information includes physical characteristics such as number of rows, number of data files, the total size of data files, and file format. For tables with partitions, the numbers are calculated for each partition and aggregated for the whole table.
This metadata is stored in the Hive metastore database and can be easily updated. If any statistic is unavailable, a value of
-1 is used as a placeholder. Some numbers, such as the number and total size of data files, are always kept up to date because they can be computed cheaply as part of the HDFS block metadata collection.
There are many ways to use such statistics information:
- As already been mentioned Hive's cost-based optimizer uses statistics to generate a query plan for complex or multi-table queries;
- Users can also get data distribution details such as top 10 products sold, the age distribution in the customer's table, etc;
- Users can quickly get answers to some of their queries by referring only to stored statistics instead of running time-consuming execution plans.
There are two ways Hive table statistics are computed.
Autogathering statistics. For newly created tables and/or partitions, automatically computed by default. The user has to explicitly set the boolean variable
falseso that statistics are not automatically computed and stored into Hive metastore.
Collect statistics for existing tables using
ANALYZEcommand. For existing tables and/or partitions, the user can issue the
ANALYZEcommand to gather statistics and write them into Hive metastore.
When working with data in S3, ADLS, or WASB, the steps for analyzing tables are the same as when working with data in HDFS.
Fast Hive count
If you are looking for a quick way to get
COUNT(*) for a complete table or some of its partitions, just use
EXPLAIN directly before the query. HUE users can click on Explain button besides Execute button. It will not perform map-reduce and you can get counts within seconds.
This row count is taken from Hive table statistics and is NOT accurate.
hive > EXPLAIN SELECT COUNT(*) FROM flights; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: flights Statistics: Num rows: 100000 Data size: 5100000 Basic stats: COMPLETE Column stats: COMPLETE Select Operator Statistics: Num rows: 100000 Data size: 5100000 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint) Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.294 seconds, Fetched: 21 row(s)
Look at the explain output and locate Num rows statistics row for the
TableScan operator. In the above explain output, table statistics shows the row count for the table is 100000 and table size in bytes is 5100000.
This will not work for queries other than simple
COUNT(*) from the table. It may work sometimes if you want to get a record count for certain partitions, but it will only work with partition columns.
In this case, you can add
WHERE partition_column1 = value1 AND partition_column2 = value2 and see similar result.