# Thread: Intelligently calculate y-axis values on a graph

1. Web Hosting Master
Join Date
May 2009
Posts
766

## Intelligently calculate y-axis values on a graph

Hi all,

I'm building a graph. My x values are known (months), but my y-values can be all over the place (within the integer family, at least ).

Here are my requirements:

1. The y-axis must have 10 labeled steps (bottom, top and 8 intermediary)
2. The step values must all be "nice round numbers"
3. The top of the graph must be >= my highest value and no less than zero
4. The bottom of the graph must be <= my lowest value and no more than zero (ie, zero must be on the graph somewhere!)

Here's my current solution:
PHP Code:
``` \$min = 0; // min will never be more than zero\$max = 0; // max will never be less than zerofor (\$i = 0; \$i < 12; ++\$i){  \$min = min(\$min, \$values[\$i]);  \$max = max(\$max, \$values[\$i]);}// at this point, i have my zero-adjusted min/max data points// this is where I'm not too happy...\$min_abs = abs(\$min);\$max_abs = abs(\$max);\$block = pow(10, intval(log(max(\$min_abs, \$max_abs), 10)));// block is now equal to 10 * the number of digits in the value// min/max which is farthest from zero...\$firstLabel = 0; // graph bottom labelwhile (\$firstLabel < \$min_abs) \$firstLabel += \$block;// now my bottom label is the first factor of block > my min value\$lastLabel = 0; // graph top labelwhile (\$lastLabel < \$max_abs) \$lastLabel += \$block;// now my top label is the first factor of block > my max value// the bottom label will always be negative or zero, so:\$firstLabel = 0 - \$firstLabel;// the min value for the top label is zero, so don't need to worry about sign// there is still the possibility that both labels = 0, so force a rangeif (\$firstLabel == \$lastLabel) \$lastLabel += 10;// set my label step value\$step = (\$lastLabel - \$firstLabel) / 10;  ```
This has been a mostly good solution. I don't get any bad results, but I get some irritating ones...

data(min, max) graph(min, max, step):

I'm good with results like these:

data(2,570, 294,801) graph (0, 300,000, 30,000)
data(7,307, -7,307) graph (-8,000, 8,000, 1,600)

However. some are like this:

data(-20,160, 253,410) graph(-100,000, 300,000, 40,000)

Here are the problems with the above data set:

1. With a step of 40,000, the max value is not between the highest (300,000) and second highest (260,000) steps...
2. The low value is less than one full step below zero, yet the graph goes down to -100,000.
3. Zero doesn't line up with any steps (jumps from -20,000 to 20,000)

So, basically, I get a "correct" graph which follows my requirements, but it is weird to not have a label for zero and to have so much "white-space" on either side of the min/max values. If I were drawing this by hand, I would have probably chosen a graph range of -30,000 to 270,000 with a step of 30,000.

My theoretical worst case scenario looks like this:

data(-1, 1,000,001) graph(-1,000,000, 2,000,000, 300,000)

Has anyone implemented a solution for this kind of problem that you've been happy with?

TIA,
Matt
Last edited by mattle; 10-30-2009 at 05:23 PM.

2. Web Hosting Master
Join Date
Mar 2009
Posts
2,218
Originally Posted by mattle

I'm good with results like these:

data(2,570, 294,801) graph (0, 300,000, 30,000)
data(7,307, -7,307) graph (-8,000, 8,000, 1,600)

However. some are like this:

data(-20,160, 253,410) graph(-100,000, 300,000, 40,000)

Here are the problems with the above data set:

1. With a step of 40,000, the max value is not between the highest (300,000) and second highest (260,000) steps...
2. The low value is less than one full step below zero, yet the graph goes down to -100,000.
3. Zero doesn't line up with any steps (jumps from -20,000 to 20,000)
So what do you want the program to produce as the step size, minimum and maximum for that set of data?

As far as I can see, the stepsize has to be greater than or equal to (max - min) / 9.

And because you want "nice round numbers", then for some sets of data,

(max - min) < 8 * stepsize

You can always make zero line up with a step, but it may require increasing the step size.

3. Web Hosting Master
Join Date
May 2009
Posts
766
Like I said...

I would have probably chosen a graph range of -30,000 to 270,000 with a step of 30,000.
It's hard to say what the step size ought to be. I don't think there should be a whole step outside of the min/max data points (ie, if my highest data point is 85k, I wouldn't want to see a graph that went up to 0-100k with a step size of 10k, however if it went from -100k - 100k with a step size of 20k, that would be ok, provided I had data where -80k > x > -100k)

You can always make zero line up with a step, but it may require increasing the step size.
From my example above, actually reducing the step size (and range, accordingly) not only fit the data better but gave my an incident of zero in my step.

4. Web Hosting Master
Join Date
May 2009
Posts
766
Originally Posted by mattle
Hi all,

1. The y-axis must have 10 labeled steps (bottom, top and 8 intermediary)
Correction...that should be 11 steps, e.g:

0 1 2 3 4 5 6 7 8 9 10
Originally Posted by mattle
// block is now equal to 10 * the number of digits in the value
And that should read 10 ^ one less than the number of digits...e.g.

PHP Code:
``` echo pow(10, intval(log(\$x, 10));// if 99 < \$x < 1000, then log(\$x) = 2.something and 10^(intval(2.something)) = 100// so, if value is 3 digits, block = 100, etc.  ```
Last edited by mattle; 11-01-2009 at 07:41 PM.

5. Web Hosting Master
Join Date
Mar 2009
Posts
2,218
But that requires 11 steps not 10.

The data (-20,160, 253,410) has a range of 273,570, which is less than 10*30,000, and can be reasonably well represented by the 11 steps -30,000 ... 270,000.

But what if the data had a range of 301,000? You'd need to increase the step size. And if you increase it to 40,000, there will be a range of 400,000, so there will be "white space" of 99,000, which is more than two steps.

I suppose what you could do is have an ordered list of "acceptable" steps, calculate the number of positive and negative steps for the data range, and pick the first one where the sum of numbers of positive and negative steps is acceptable.

So the list of steps might be

1,1.25,1.666,2,2.5,3,3.333,4,5,10,12.5,16.666,20,25,30,33.333, and so on.

Some of those are not "round numbers" but it's reasonably common to divide, say, a range of 10 into 3, or 6.

6. Web Hosting Master
Join Date
May 2009
Posts
766
Originally Posted by tim2718281
But that requires 11 steps not 10.

The data (-20,160, 253,410) has a range of 273,570, which is less than 10*30,000, and can be reasonably well represented by the 11 steps -30,000 ... 270,000.

Originally Posted by tim2718281
But what if the data had a range of 301,000? You'd need to increase the step size. And if you increase it to 40,000, there will be a range of 400,000, so there will be "white space" of 99,000, which is more than two steps.
It's interesting that you focus on the step size. I approached the problem from a different perspective, using factors of 10 to figure out the upper and lower limits first, thus guaranteeing a clean division by ten for the step size.

Originally Posted by tim2718281
I suppose what you could do is have an ordered list of "acceptable" steps, calculate the number of positive and negative steps for the data range, and pick the first one where the sum of numbers of positive and negative steps is acceptable.

So the list of steps might be

1,1.25,1.666,2,2.5,3,3.333,4,5,10,12.5,16.666,20,25,30,33.333, and so on.

Some of those are not "round numbers" but it's reasonably common to divide, say, a range of 10 into 3, or 6.
I've been avoiding an ordered list because my data sets may literally range from a set with all zeroes (forced into a 0-10 range) to sets with up to 9 digits. I would really like to avoid pre-defined "acceptable" steps.

Your solution would probably help cases like this:

data(7,307, -7,307) graph (-8,000, 8,000, 1,600)

Really, 1,500 would be a more optimal step size here...would yield a range of -7500 - 7500. However, if I had pre-defined step sizes, I may have jumped from 1500 to 2000. A data set of (-7,895, 7,895) would get bumped up to a graph range of (-10,000 to 10,000), in which case, my original algorithm would have yielded a better result.

Also, keep in mind that what I have now meets my specifications...I'm really interested in this for the mental exercise--how do you design an algorithm to choose your y-axis values in an intuitive manner? I'm not really looking for methods that take intelligence out of the calculations...this isn't one of those "just get it done" scenarios

I am interested in exploring the idea of calculating the step size first. Perhaps a method of calculating a range of potential step sizes and then scoring them with weighting on closeness to the data range and factor-ability.

Of course, that opens up a new incarnation of the same problem. It would certainly be wasteful to test every candidate, especially in a situation where we've decided the step size will be between 30k and 40k. At that point, the "real" candidates are probably 30k, 35k and 40k. It's ridiculous to audition 10,000 integers. But again, we're facing with the problem of choosing logical increments to encompass a range of numbers. Perhaps the solution is recursive...

7. Web Hosting Master
Join Date
May 2009
Posts
766
For what it's worth, here's my progress so far (in Perl, just for kicks):

PHP Code:
``` use strict;# recursively assess "prettiness" based on factorabilitysub factorScore{    my \$testvalue = shift;    my @preferred_factors = (10, 5);    foreach (@preferred_factors)  {        return 1 if (\$testvalue < \$_);         if (\$testvalue % \$_ == 0)        {            return \$_ + factorScore(\$testvalue / \$_);        }  }  return 1;}# determine if a given step yields a range (including zero-crossing)# that fits data pointssub fitsRange{    my \$l = shift;    my \$h = shift;    my \$step = shift;        my \$factor = sprintf("%d", \$h / \$step + (\$h % \$step != 0));    my \$hv = \$step * \$factor;    my \$lv = \$step * (\$factor - 10);        #printf("high: %d, low: %d\n", \$hv, \$lv);        return (\$h <= \$hv && \$l >= \$lv);}my \$low = \$ARGV[0];my \$high = \$ARGV[1];# find the first step that has a zero-crossing and fits the data rangemy \$minStep = sprintf("%d", (\$high - \$low) / 10);++\$minStep while (!fitsRange(\$low, \$high, \$minStep));# arbitrarily entertain steps up to 10% bigger# (higher % = higher likelihood of a prettier step number#  lower % = higher likelihood of a tigher data range)my \$maxStep = sprintf("%d", \$minStep * 1.1);# Calculated min/max stepsprintf("Min %d, Max %d\n", \$minStep, \$maxStep);my (\$bestScore, \$bestStep);# this part still sucks...can be hundreds of thousands of iterations# I'm thinking I need a way to make the whole process recursive# so that it steps through this secondary range in a logical manner# ie, if \$minStep = 29567 and \$maxStep = 32523, we should pick maybe# ten values to audition:#   (30000, 30250, 30500 ... 32500)# (ie, we would run the range 29567-32523 through the same process,# searching for 12 steps and then audition the inner 10...)for (my \$i = \$minStep; \$i <= \$maxStep; ++\$i){  my \$rangeDiff = (\$high - \$low) / (10 * \$i);  my \$fs = factorScore(\$i);  my \$score = \$fs * \$rangeDiff;  (\$bestScore = \$score) && (\$bestStep = \$i)    if (\$score > \$bestScore && fitsRange(\$low, \$high, \$i));}# who won?printf("%4d : %4f\n", \$bestStep, \$bestScore);# draw y-axismy \$factor = sprintf("%d", \$high / \$bestStep + (\$high % \$bestStep != 0));for (my \$i = 0; \$i < 11; ++\$i){  printf("%10d |\n", \$factor * \$bestStep);  --\$factor;}  ```
And some results based on previously mentioned data ranges:

[email protected]:~# ./test.pl 0 294801
Min 29481, Max 32429
30000 : 40.289470
300000 |
270000 |
240000 |
210000 |
180000 |
150000 |
120000 |
90000 |
60000 |
30000 |
0 |
[email protected]:~# ./test.pl -7307 7307
Min 1462, Max 1608
1500 : 25.330933
7500 |
6000 |
4500 |
3000 |
1500 |
0 |
-1500 |
-3000 |
-4500 |
-6000 |
-7500 |
[email protected]:~# ./test.pl -20160 253410
Min 28157, Max 30972
30000 : 37.387900
270000 |
240000 |
210000 |
180000 |
150000 |
120000 |
90000 |
60000 |
30000 |
0 |
-30000 |

Notice here it picks 30,000 as the step that I would have chosen "intuitively" in the OP

[email protected]:~# ./test.pl -1 1000001
Min 111112, Max 122223
120000 : 34.166735
1080000 |
960000 |
840000 |
720000 |
600000 |
480000 |
360000 |
240000 |
120000 |
0 |
-120000 |

And this is certainly better than the 300,000 step my previous implementation would yield.

So now, I'm only facing the remaining problem of making the auditioning process presuppose likely candidates. Knowing that some data sets will get into the 10^9 range, that means we can be looking at step values in the 10^8 range. With a 10% difference between lowest and highest candidate values, that means we could be performing 10^7 auditions.

Here's what I'm thinking for a recursive implementation:

Original data range: 10^9
Range of viable steps: 10^7

Original data range: 10^7
Range of viable steps: 10^5

Original data range: 10^5
Range of viable steps: 10^3

Original data range: 10^3
Range of viable steps: 10^1

If we're testing 10 candidates at each stage, we end up with 10*4 auditions for a 10^9 data range as opposed to 10^7 auditions. A 25,000 times more efficient algorithm!

I think it's really just a question of implementation from here.

Of course, I'd love to get community input on this.

Matt

@tim: thanks again for opening my eyes to the concept of calculating the step first...I think that approach really yielded much more usable results.

8. Web Hosting Master
Join Date
Mar 2009
Posts
2,218
It's an interesting problem

I'm not sure there are that many candidates.

It seems to me that the numbers you'd use may be of the form

s * u / v

where

s is a scale factor, such as 10000
u and v are who numbers from 1 to 10

A couple of minutes with a spreadsheet yielded the following set of possible choices

Code:
```  	1       	2	       3	       4	       5	       6	       7	       8	       9	        10
1	10000	20000	30000	40000	50000	60000	70000	80000	90000	100000
2	5000	10000	15000	20000	25000	30000	35000	40000	45000	50000
3	3333.33	6666.67	10000	13333.33	16666.67	20000	23333.33	26666.67	30000	33333.33
4	2500	5000	7500	10000	12500	15000	17500	20000	22500	25000
5	2000	4000	6000	8000	10000	12000	14000	16000	18000	20000
6	1666.67	3333.33	5000	6666.67	8333.33	10000	11666.67	13333.33	15000	16666.67
7	1428.57	2857.14	4285.71	5714.29	7142.86	8571.43	10000	11428.57	12857.14	14285.71
8	1250	2500	3750	5000	6250	7500	8750	10000	11250	12500
9	1111.11	2222.22	3333.33	4444.44	5555.56	6666.67	7777.78	8888.89	10000	11111.11
10	1000	2000	3000	4000	5000	6000	7000	8000	9000	10000```
I wonder, are there any numbers you'd use that cannot be formed by multiplying a number on the list times a power of ten?

9. Web Hosting Master
Join Date
May 2009
Posts
766
Tim...that's an interesting concept...I'm curious how you'd implement it given the inputs of only the endpoints of the data range.

It seems that s would have to be determined by the data range, as it limits the maximum high point on the graph--which again yields the same troublesome question...how do you dynamically pick a logical value for s? Recall that using a factor of ten was part of the original process that yielded graph ranges that were too wide (the -1 to 1,000,001 case).

I also don't see there ever being a case where I'd want a y-axis value that isn't an integer, so if u represents the individual step values, the only valid values of v would be those that are factors of s. Restricting s to 10^n (n > 0) then restricts v to (1, 2, 5, 10) ( U (4) where n > 1 ) ( U (8) where n > 2 )

Also, the data sets produced here would need to be "shiftable" to accommodate negative numbers in the graph range. This gets tricky with zero-crossings. Say I've got a data range of -10 to 9990. Assuming we use 10,000 for s and 10 for v (as the row in the chart that has an exact matching range). We can shift the range by ten to come up with a set of points that encompass the data:

-10 990 1990 2990 3990 4990 5990 6990 7990 8990 9990

However, I wouldn't argue that that is the best set of values for our data. Assuming that we don't want fractional y-axis points, we need to keep moving up the chart to find a graph range that suits all of our needs. The first candidate is v=8, shifted by 1250:

-1250 0 1250 2500 3750 5000 6250 7500 8750 10000 11250

Which is not bad. By comparison, my implementation yields:

-1200 0 1200 2400 3600 4800 6000 7200 8400 9600 10800

(although with a minimum step that yields a zero-crossing calculated at 1110 and a maximum allowable step at 10% greater being 1221, it did not get as far as to even audition a step of 1250)

It is of course possible that we could arrive at the same set of numbers if s can be a number that is not a power of ten, or is we expand the range of possible values for v.

Matt

10. Web Hosting Master
Join Date
Mar 2009
Posts
2,218
Originally Posted by mattle
Tim...that's an interesting concept...I'm curious how you'd implement it given the inputs of only the endpoints of the data range.
OK

1) take input min_value, max_value, number_of_steps

2) calculate range, = max(max_value, 0 - min_value, max_value - min_value)

(Both min_value and max_value can have the same sign, and the range must include 0)

3) calculate the minimum and maximum stepsize; the maximum is simply the range, and the minimum is the range divided by the number of steps.

4) Now the tricky bit; I suggested candidates might be of the form (u/v)*10^p; and you said you only wanted integers.

So have three nested loops, with u and v ranging from 1 to 10, and p from 1 to some maximum ... eg 10.

Calculate the result from the formula (u/v)*10^p, and if it's an integer between minum and maximum stepsize, add it to the list of candidates.

5) Sort the list of candidates and print them or whatever.

Here's quick and dirty PHP code to explain that:

Code:
```<?php
\$lower = \$argv[1];
\$upper = \$argv[2];
\$steps = \$argv[3];
\$maxpower = 10;

print "lower is " . \$lower . "\n" ;
print "upper is " . \$upper . "\n" ;
print "steps is " . \$steps . "\n" ;

\$range = \$upper - \$lower ;
print "range is " . \$range . "\n" ;

\$minstep = \$range / \$steps;
\$maxstep = max(\$range, abs(\$upper) , abs(\$lower)) ;

print "minimum step is " . \$minstep . "\n" ;
print "maximum step is " . \$maxstep . "\n" ;

print "calculating ...";

\$maxval = 10;
\$ncandidates = 0;

for (\$u = 1;  \$u <= 10;  \$u++) {
for (\$v = 1;  \$v <= 10;  \$v++) {
\$baseval = \$u / \$v ;
for (\$p = 0; \$p <= \$maxpower; \$p++) {
\$factor = pow(10 , \$p);
\$t = \$baseval * \$factor ;
if ((floor(\$t) == \$t) & \$t >= \$minstep & \$t <= \$maxstep) {
\$candidates[\$ncandidates++] = \$t;
}
}
}
}
sort(\$candidates);
var_dump(array_unique(\$candidates));

?>```

And here is the result of

php graph.php -10 9990 10

[[email protected] programming]\$ php graph.php -10 9990 10
lower is -10
upper is 9990
steps is 10
range is 10000
minimum step is 1000
maximum step is 10000
calculating ...array(26) {
[0]=>
int(1000)
[12]=>
float(1125)
[13]=>
float(1200)
[14]=>
float(1250)
[17]=>
float(1400)
[18]=>
float(1500)
[21]=>
float(1600)
[22]=>
float(1750)
[23]=>
float(1800)
[24]=>
int(2000)
[31]=>
float(2250)
[32]=>
float(2500)
[36]=>
int(3000)
[40]=>
float(3500)
[41]=>
float(3750)
[42]=>
float(4000)
[46]=>
float(4500)
[47]=>
float(5000)
[54]=>
float(6000)
[57]=>
float(6250)
[58]=>
float(7000)
[60]=>
float(7500)
[62]=>
float(8000)
[65]=>
float(8750)
[66]=>
int(9000)
[68]=>
int(10000)
}

Next, you'd write code to calculate the sum of the number of positive steps and the number of negative steps.

So for -10 to 9990 and candidate 1000, there would be 10 positive steps and 1 negative step, for a total of 11, which is more than allowed. So 1000 would be excluded.

Next is 1125, which would give 9 positive steps and one negative, which is 10 steps; so that's a possibility.

But 1100 is missing! That's because with u and v ranging from 1 to 10, and multiplying by powers of 10, there's so way to generate multiples of 11. (Sure, 1100 would be excluded anyway from -10 to 9990, but it ought to be a candidate. Candidates depend on range, not specific values.)

So a quick update to the code, to have u and v range from say 1 to 16.

And now we get

[0]=>
int(1000)
[18]=>
float(1100)
[20]=>
float(1125)
[21]=>
float(1200)
[24]=>
float(1250)
[29]=>
int(1300)
[31]=>
float(1375)
[32]=>
float(1400)
[35]=>
float(1500)
[41]=>
float(1600)
[44]=>
float(1625)
[45]=>
float(1750)
[47]=>
float(1800)
[48]=>
float(1875)
[50]=>
int(2000)
...
Last edited by tim2718281; 11-12-2009 at 08:07 AM.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•