To get an accurate count of the amount of data, when you can't have any less or one more, COUNT(*) is the only way. But there are times when you don't need an exact number, but you need a rough estimate of the table size, for example, to understand that the table is not empty, or to roughly estimate the size of the data to be migrated. There is a faster way than COUNT(*) for such tasks. We can use Hive statistics.

Hive statistics

Hive can better optimize complex or multi-table queries if it has access to statistics about the amount of data and how the values are distributed. This information includes physical characteristics such as number of rows, number of data files, the total size of data files, and file format. For tables with partitions, the numbers are calculated for each partition and aggregated for the whole table.

This metadata is stored in the Hive metastore database and can be easily updated. If any statistic is unavailable, a value of -1 is used as a placeholder. Some numbers, such as the number and total size of data files, are always kept up to date because they can be computed cheaply as part of the HDFS block metadata collection.

There are many ways to use such statistics information:

  • As already been mentioned Hive's cost-based optimizer uses statistics to generate a query plan for complex or multi-table queries;
  • Users can also get data distribution details such as top 10 products sold, the age distribution in the customer's table, etc;
  • Users can quickly get answers to some of their queries by referring only to stored statistics instead of running time-consuming execution plans.

There are two ways Hive table statistics are computed.

  • Autogathering statistics. For newly created tables and/or partitions, automatically computed by default. The user has to explicitly set the boolean variable hive.stats.autogather to false so that statistics are not automatically computed and stored into Hive metastore.

  • Collect statistics for existing tables using ANALYZE command. For existing tables and/or partitions, the user can issue the ANALYZE command to gather statistics and write them into Hive metastore.

When working with data in S3, ADLS, or WASB, the steps for analyzing tables are the same as when working with data in HDFS.

Fast Hive count

If you are looking for a quick way to get COUNT(*) for a complete table or some of its partitions, just use EXPLAIN directly before the query. HUE users can click on Explain button besides Execute button. It will not perform map-reduce and you can get counts within seconds.

This row count is taken from Hive table statistics and is NOT accurate.

hive > EXPLAIN SELECT COUNT(*) FROM flights;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: flights
            Statistics: Num rows: 100000 Data size: 5100000 Basic stats: COMPLETE Column stats: COMPLETE
            Select Operator
              Statistics: Num rows: 100000 Data size: 5100000 Basic stats: COMPLETE Column stats: COMPLETE
              Group By Operator
                aggregations: count()
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                Reduce Output Operator
                  sort order: 
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                  value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.294 seconds, Fetched: 21 row(s)

Look at the explain output and locate Num rows statistics row for the TableScan operator. In the above explain output, table statistics shows the row count for the table is 100000 and table size in bytes is 5100000.

This will not work for queries other than simple COUNT(*) from the table. It may work sometimes if you want to get a record count for certain partitions, but it will only work with partition columns.

In this case, you can add WHERE partition_column1 = value1 AND partition_column2 = value2 and see similar result.

Last updated Mon Jun 07 2021