Friday, January 25, 2013

Histograms with Google charts

It seems that today (Jan 2013) there is no histogram chart in Google charts so continuing my experimenting with column charts in the previous post I decided to look at this particular problem.

Before looking at charts lets first look at what we have. In a very general sense we have

  • a series of data
  • a list of bins.
    The data points will be distributed into the bins according to some given criteria and the number of entries per bin (called frequency) will be counted.

    While this sounds very abstract a couple of examples will explain this.

    • Example 1: a numerical data series e.g. some kind of measurements like 1.5, 4.03, 2.6 etc. and the bins are disjunct intervals e.g. 0-2,2-4, etc. and the distribution is simply according to mathematical comparison greater or less.
    • Example 2: an ordinary series e.g. the list of grades of a class A,C,A,F,B,B,C,A,A etc. and the bins are the grades A,B,C,D,F and the criteria is simple enough.

    The histogram is then the graphical display of bin values vs. frequency as adjacent bars.

    Since Google charts can do column charts we are halfway there. What is missing is - and what I did - is to generate the two-dimensional array needed as an input for the Google column chart.

    My example covers only the simple numerical case. I start with a data series and some bins and of course a title which should explain what actually has been measured.

    var series = [   1, 3, 5, 7, 2.5, 3.1, 0.45, 5.1, 8.3, 4, 5.11, 3.9 ];
    var seriesTitle = "Length";
    
    var bins      = [   2,    4,    6 ];
    // There should be one for each bin plus an extra for larger values
    var binTitles = ['0-2','2-4','4-6','more'];  
    
    
    The bins should be interpreted as the endpoints of intervals i.e. everything up to and including 2, from 2 to 4, from 4 to 6. If there are values larger than 6 an unnamed bin will be added. The histogram looks like this:

    Starting with these data a two-dimensional array called histo will be created var histo = new Array(); which will eventually look like this:

    [ [ 'Length', 'freq' ]
      [ '0-2',  2 ]
      [ '2-4',  5 ]
      [ '4-6',  3 ]
      [ 'more', 2 ] ]
    

    First there is a function to initialize the histo array

    function initHisto(title,bins) {
      // header line
      histo.push([]);
      histo[0][0]   = title;
      histo[0][1]   = "freq";
    
      // create one row for each bin
      for(b=0; b<binTitles.length; b++ ) {
        // Create new row
        histo.push([]);
        histo[b+1][0]       = ""+binTitles[b];
        histo[b+1][1]       = 0;
      }
    }
    
    initHisto( seriesTitle, bins );
    
    The following function called frequency counts the entries per bin and puts it into the corresponding histo cell.
    var maxFreq     = 0;    // Necessary to set the maximum y-value
    
    function frequency( series, bins ) {
      for(d=0; d<series.length; d++ ) {
        // first bin
        if( series[d]<=bins[0] ) {
            histo[1][1]++;
            continue;
        }
        // last unnamed bin
        if( bins[bins.length-1]<series[d] ) {
            histo[bins.length+1][1]++;
            continue;
        }
        // any bin in between
        for(b=0; b<bins.length-1; b++ ) {
           if( bins[b]<series[d] && series[d]<=bins[b+1] ) {
              histo[b+2][1]++;
           }
        }
      }
    
      for(h=1; h<histo.length; h++ ) {
        if( maxFreq<histo[h][1] ) {
          maxFreq   = histo[h][1];
        }
      }
    }
    
    frequency( series, bins );
    

    Now that the histo array has been constructed it can be fed to the Google charts like google.visualization.arrayToDataTable( histo );. The chart needs some histogram specific tweaking which I'll explain.

    function drawChart1() {
      var data = google.visualization.arrayToDataTable( histo );
    
      var numGrids;
      //  if maxFreq is odd we make it even
      if( maxFreq%2 == 1 ) {
        maxFreq++;
      }
      //  the grid lines should be every even number
      numGrids        = maxFreq/2 +1;
    
      var options = {
        title:  'Histogram',
        legend: { position: 'none' },     // no legend
        bar:    { groupWidth: '99%' },    // in order to increase the thickness of the bars with a little space in between
        vAxis:  { title: histo[0][1], minValue: 0, maxValue: maxFreq, gridlines: { count: numGrids } },
        hAxis:  { title: histo[0][0] },
        backgroundColor: {strokeWidth: 2 },   // to get a nice box
      };
    
      var chart = new google.visualization.ColumnChart(document.getElementById('chart_divH'));
      chart.draw(data, options);
    }
    

    I put all of the above into one section enclosed by <script>..</script> tags but it could be separated and the histo calculation can be done separately.
    Unfortunately the chart options are not quite independent of the data. E.g. the number of grid lines needs to change and be made smaller for higher frequencies in order to display nicely, the groupWidth needs to be made bigger if more bins are displayed in order to see a little distinction between the bars and probably also depending on the final width and height of the chart. The width of the chart needs to increase if a larger number of bins should be displayed nicely.

    This will display the chart in the HTML body part.

    <div id="chart_divH" style="width: 300px; height: 300px;"></div>
    

    If you want to use other types of data series you need to change the frequency function and instead of mathematical greater/less comparisons you need to write the appropriate code for your case. The given 'grades' example could be something like if( series[d]==bins[b] ) { histo[b+1][1]++; }

  • 1 comment: