Summary
Hive's table partitioning allows the user to have better querying performance as it avoids costly scans over data that is not relevant for the user. This partitioning is based on the data structure to be found in HDFS, as partitions match with directories.
One quite common usage for partitions is time series, which can be modeled like this:
|
Our TimeSeries entity |
Note: The time-stamp has been decomposed in several columns that will form the partitions. Month and day are
TinyInt in order to save space (we might end up having millions of records).
This can be easily partitioned by using this folder structure in HDFS, if we take a look we will see these directories (data shown for our table being partitioned by year and month):
As per the Hive documentation, the most optimal partition size is 2 Gb. What does it happen if you have defined a partitioning system that ends up in production with too big or too small partitions? Here some examples on how can you adjust your partitions using Spark.