3 Steps to prepare big data for plotting using Hive’s Histogram.
Computes an approximate histogram of a numerical column using a user-specified number of bins. The output is an array of (x,y) pairs as Hive struct objects that represents the histogram’s bin centers(x value) & the histogram height(y value).
Even though this function creates a histogram with non- uniform bin widths but to some extent its comparable to the histogram produced by R statistical language.
Sample Table: TESTHIST
1. Start hive
2. Execute the HISTOGRAM_NUMERIC function for the above table.
hive> select histogram_numeric(value,5) from TESTHIST;
3. The Output results
Now the data is available and it’s upto you to plot them in GNUPLOT, EXCEL, D3 and MATLAB etc…
Some Use cases
- Estimating the frequency distribution of a column, possibly grouped by other attributes.
- Choosing discretization points in a continuous valued column.