
|
View Full Version : Intelligently calculate y-axis values on a graph
mattle 10-30-2009, 05:18 PM Hi all,
I'm building a graph. My x values are known (months), but my y-values can be all over the place (within the integer family, at least :rolleyes: ).
Here are my requirements:
The y-axis must have 10 labeled steps (bottom, top and 8 intermediary)
The step values must all be "nice round numbers"
The top of the graph must be >= my highest value and no less than zero
The bottom of the graph must be <= my lowest value and no more than zero (ie, zero must be on the graph somewhere!)
Here's my current solution:
$min = 0; // min will never be more than zero
$max = 0; // max will never be less than zero
for ($i = 0; $i < 12; ++$i)
{
$min = min($min, $values[$i]);
$max = max($max, $values[$i]);
}
// at this point, i have my zero-adjusted min/max data points
// this is where I'm not too happy...
$min_abs = abs($min);
$max_abs = abs($max);
$block = pow(10, intval(log(max($min_abs, $max_abs), 10)));
// block is now equal to 10 * the number of digits in the value
// min/max which is farthest from zero...
$firstLabel = 0; // graph bottom label
while ($firstLabel < $min_abs) $firstLabel += $block;
// now my bottom label is the first factor of block > my min value
$lastLabel = 0; // graph top label
while ($lastLabel < $max_abs) $lastLabel += $block;
// now my top label is the first factor of block > my max value
// the bottom label will always be negative or zero, so:
$firstLabel = 0 - $firstLabel;
// the min value for the top label is zero, so don't need to worry about sign
// there is still the possibility that both labels = 0, so force a range
if ($firstLabel == $lastLabel) $lastLabel += 10;
// set my label step value
$step = ($lastLabel - $firstLabel) / 10;
This has been a mostly good solution. I don't get any bad results, but I get some irritating ones...
data(min, max) graph(min, max, step):
I'm good with results like these:
data(2,570, 294,801) graph (0, 300,000, 30,000)
data(7,307, -7,307) graph (-8,000, 8,000, 1,600)
However. some are like this:
data(-20,160, 253,410) graph(-100,000, 300,000, 40,000)
Here are the problems with the above data set:
With a step of 40,000, the max value is not between the highest (300,000) and second highest (260,000) steps...
The low value is less than one full step below zero, yet the graph goes down to -100,000.
Zero doesn't line up with any steps (jumps from -20,000 to 20,000)
So, basically, I get a "correct" graph which follows my requirements, but it is weird to not have a label for zero and to have so much "white-space" on either side of the min/max values. If I were drawing this by hand, I would have probably chosen a graph range of -30,000 to 270,000 with a step of 30,000.
My theoretical worst case scenario looks like this:
data(-1, 1,000,001) graph(-1,000,000, 2,000,000, 300,000)
Has anyone implemented a solution for this kind of problem that you've been happy with?
TIA,
Matt
tim2718281 11-01-2009, 07:00 PM I'm good with results like these:
data(2,570, 294,801) graph (0, 300,000, 30,000)
data(7,307, -7,307) graph (-8,000, 8,000, 1,600)
However. some are like this:
data(-20,160, 253,410) graph(-100,000, 300,000, 40,000)
Here are the problems with the above data set:
With a step of 40,000, the max value is not between the highest (300,000) and second highest (260,000) steps...
The low value is less than one full step below zero, yet the graph goes down to -100,000.
Zero doesn't line up with any steps (jumps from -20,000 to 20,000)
So what do you want the program to produce as the step size, minimum and maximum for that set of data?
As far as I can see, the stepsize has to be greater than or equal to (max - min) / 9.
And because you want "nice round numbers", then for some sets of data,
(max - min) < 8 * stepsize
You can always make zero line up with a step, but it may require increasing the step size.
mattle 11-01-2009, 07:33 PM Like I said...
I would have probably chosen a graph range of -30,000 to 270,000 with a step of 30,000.
It's hard to say what the step size ought to be. I don't think there should be a whole step outside of the min/max data points (ie, if my highest data point is 85k, I wouldn't want to see a graph that went up to 0-100k with a step size of 10k, however if it went from -100k - 100k with a step size of 20k, that would be ok, provided I had data where -80k > x > -100k)
You can always make zero line up with a step, but it may require increasing the step size.
From my example above, actually reducing the step size (and range, accordingly) not only fit the data better but gave my an incident of zero in my step.
mattle 11-01-2009, 07:36 PM Hi all,
The y-axis must have 10 labeled steps (bottom, top and 8 intermediary)
Correction...that should be 11 steps, e.g:
0 1 2 3 4 5 6 7 8 9 10
// block is now equal to 10 * the number of digits in the value
And that should read 10 ^ one less than the number of digits...e.g.
echo pow(10, intval(log($x, 10));
// if 99 < $x < 1000, then log($x) = 2.something and 10^(intval(2.something)) = 100
// so, if value is 3 digits, block = 100, etc.
tim2718281 11-01-2009, 09:11 PM But that requires 11 steps not 10.
The data (-20,160, 253,410) has a range of 273,570, which is less than 10*30,000, and can be reasonably well represented by the 11 steps -30,000 ... 270,000.
But what if the data had a range of 301,000? You'd need to increase the step size. And if you increase it to 40,000, there will be a range of 400,000, so there will be "white space" of 99,000, which is more than two steps.
I suppose what you could do is have an ordered list of "acceptable" steps, calculate the number of positive and negative steps for the data range, and pick the first one where the sum of numbers of positive and negative steps is acceptable.
So the list of steps might be
1,1.25,1.666,2,2.5,3,3.333,4,5,10,12.5,16.666,20,25,30,33.333, and so on.
Some of those are not "round numbers" but it's reasonably common to divide, say, a range of 10 into 3, or 6.
mattle 11-01-2009, 11:30 PM But that requires 11 steps not 10.
The data (-20,160, 253,410) has a range of 273,570, which is less than 10*30,000, and can be reasonably well represented by the 11 steps -30,000 ... 270,000.
Yes, please note my correction.
But what if the data had a range of 301,000? You'd need to increase the step size. And if you increase it to 40,000, there will be a range of 400,000, so there will be "white space" of 99,000, which is more than two steps.
It's interesting that you focus on the step size. I approached the problem from a different perspective, using factors of 10 to figure out the upper and lower limits first, thus guaranteeing a clean division by ten for the step size.
I suppose what you could do is have an ordered list of "acceptable" steps, calculate the number of positive and negative steps for the data range, and pick the first one where the sum of numbers of positive and negative steps is acceptable.
So the list of steps might be
1,1.25,1.666,2,2.5,3,3.333,4,5,10,12.5,16.666,20,25,30,33.333, and so on.
Some of those are not "round numbers" but it's reasonably common to divide, say, a range of 10 into 3, or 6.
I've been avoiding an ordered list because my data sets may literally range from a set with all zeroes (forced into a 0-10 range) to sets with up to 9 digits. I would really like to avoid pre-defined "acceptable" steps.
Your solution would probably help cases like this:
data(7,307, -7,307) graph (-8,000, 8,000, 1,600)
Really, 1,500 would be a more optimal step size here...would yield a range of -7500 - 7500. However, if I had pre-defined step sizes, I may have jumped from 1500 to 2000. A data set of (-7,895, 7,895) would get bumped up to a graph range of (-10,000 to 10,000), in which case, my original algorithm would have yielded a better result.
Also, keep in mind that what I have now meets my specifications...I'm really interested in this for the mental exercise--how do you design an algorithm to choose your y-axis values in an intuitive manner? I'm not really looking for methods that take intelligence out of the calculations...this isn't one of those "just get it done" scenarios :crap:
I am interested in exploring the idea of calculating the step size first. Perhaps a method of calculating a range of potential step sizes and then scoring them with weighting on closeness to the data range and factor-ability.
Of course, that opens up a new incarnation of the same problem. It would certainly be wasteful to test every candidate, especially in a situation where we've decided the step size will be between 30k and 40k. At that point, the "real" candidates are probably 30k, 35k and 40k. It's ridiculous to audition 10,000 integers. But again, we're facing with the problem of choosing logical increments to encompass a range of numbers. Perhaps the solution is recursive...
mattle 11-10-2009, 06:45 PM For what it's worth, here's my progress so far (in Perl, just for kicks):
use strict;
# recursively assess "prettiness" based on factorability
sub factorScore
{
my $testvalue = shift;
my @preferred_factors = (10, 5);
foreach (@preferred_factors)
{
return 1 if ($testvalue < $_);
if ($testvalue % $_ == 0)
{
return $_ + factorScore($testvalue / $_);
}
}
return 1;
}
# determine if a given step yields a range (including zero-crossing)
# that fits data points
sub fitsRange
{
my $l = shift;
my $h = shift;
my $step = shift;
my $factor = sprintf("%d", $h / $step + ($h % $step != 0));
my $hv = $step * $factor;
my $lv = $step * ($factor - 10);
#printf("high: %d, low: %d\n", $hv, $lv);
return ($h <= $hv && $l >= $lv);
}
my $low = $ARGV[0];
my $high = $ARGV[1];
# find the first step that has a zero-crossing and fits the data range
my $minStep = sprintf("%d", ($high - $low) / 10);
++$minStep while (!fitsRange($low, $high, $minStep));
# arbitrarily entertain steps up to 10% bigger
# (higher % = higher likelihood of a prettier step number
# lower % = higher likelihood of a tigher data range)
my $maxStep = sprintf("%d", $minStep * 1.1);
# Calculated min/max steps
printf("Min %d, Max %d\n", $minStep, $maxStep);
my ($bestScore, $bestStep);
# this part still sucks...can be hundreds of thousands of iterations
# I'm thinking I need a way to make the whole process recursive
# so that it steps through this secondary range in a logical manner
# ie, if $minStep = 29567 and $maxStep = 32523, we should pick maybe
# ten values to audition:
# (30000, 30250, 30500 ... 32500)
# (ie, we would run the range 29567-32523 through the same process,
# searching for 12 steps and then audition the inner 10...)
for (my $i = $minStep; $i <= $maxStep; ++$i)
{
my $rangeDiff = ($high - $low) / (10 * $i);
my $fs = factorScore($i);
my $score = $fs * $rangeDiff;
($bestScore = $score) && ($bestStep = $i)
if ($score > $bestScore && fitsRange($low, $high, $i));
}
# who won?
printf("%4d : %4f\n", $bestStep, $bestScore);
# draw y-axis
my $factor = sprintf("%d", $high / $bestStep + ($high % $bestStep != 0));
for (my $i = 0; $i < 11; ++$i)
{
printf("%10d |\n", $factor * $bestStep);
--$factor;
}
And some results based on previously mentioned data ranges:
root@host:~# ./test.pl 0 294801
Min 29481, Max 32429
30000 : 40.289470
300000 |
270000 |
240000 |
210000 |
180000 |
150000 |
120000 |
90000 |
60000 |
30000 |
0 |
root@host:~# ./test.pl -7307 7307
Min 1462, Max 1608
1500 : 25.330933
7500 |
6000 |
4500 |
3000 |
1500 |
0 |
-1500 |
-3000 |
-4500 |
-6000 |
-7500 |
root@host:~# ./test.pl -20160 253410
Min 28157, Max 30972
30000 : 37.387900
270000 |
240000 |
210000 |
180000 |
150000 |
120000 |
90000 |
60000 |
30000 |
0 |
-30000 |
Notice here it picks 30,000 as the step that I would have chosen "intuitively" in the OP
root@host:~# ./test.pl -1 1000001
Min 111112, Max 122223
120000 : 34.166735
1080000 |
960000 |
840000 |
720000 |
600000 |
480000 |
360000 |
240000 |
120000 |
0 |
-120000 |
And this is certainly better than the 300,000 step my previous implementation would yield.
So now, I'm only facing the remaining problem of making the auditioning process presuppose likely candidates. Knowing that some data sets will get into the 10^9 range, that means we can be looking at step values in the 10^8 range. With a 10% difference between lowest and highest candidate values, that means we could be performing 10^7 auditions.
Here's what I'm thinking for a recursive implementation:
Original data range: 10^9
Range of viable steps: 10^7
Original data range: 10^7
Range of viable steps: 10^5
Original data range: 10^5
Range of viable steps: 10^3
Original data range: 10^3
Range of viable steps: 10^1
If we're testing 10 candidates at each stage, we end up with 10*4 auditions for a 10^9 data range as opposed to 10^7 auditions. A 25,000 times more efficient algorithm!
I think it's really just a question of implementation from here.
Of course, I'd love to get community input on this.
Matt
@tim: thanks again for opening my eyes to the concept of calculating the step first...I think that approach really yielded much more usable results.
tim2718281 11-10-2009, 08:26 PM It's an interesting problem
I'm not sure there are that many candidates.
It seems to me that the numbers you'd use may be of the form
s * u / v
where
s is a scale factor, such as 10000
u and v are who numbers from 1 to 10
A couple of minutes with a spreadsheet yielded the following set of possible choices
1 2 3 4 5 6 7 8 9 10
1 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
2 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
3 3333.33 6666.67 10000 13333.33 16666.67 20000 23333.33 26666.67 30000 33333.33
4 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000
5 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
6 1666.67 3333.33 5000 6666.67 8333.33 10000 11666.67 13333.33 15000 16666.67
7 1428.57 2857.14 4285.71 5714.29 7142.86 8571.43 10000 11428.57 12857.14 14285.71
8 1250 2500 3750 5000 6250 7500 8750 10000 11250 12500
9 1111.11 2222.22 3333.33 4444.44 5555.56 6666.67 7777.78 8888.89 10000 11111.11
10 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
I wonder, are there any numbers you'd use that cannot be formed by multiplying a number on the list times a power of ten?
mattle 11-11-2009, 10:41 AM Tim...that's an interesting concept...I'm curious how you'd implement it given the inputs of only the endpoints of the data range.
It seems that s would have to be determined by the data range, as it limits the maximum high point on the graph--which again yields the same troublesome question...how do you dynamically pick a logical value for s? Recall that using a factor of ten was part of the original process that yielded graph ranges that were too wide (the -1 to 1,000,001 case).
I also don't see there ever being a case where I'd want a y-axis value that isn't an integer, so if u represents the individual step values, the only valid values of v would be those that are factors of s. Restricting s to 10^n (n > 0) then restricts v to (1, 2, 5, 10) ( U (4) where n > 1 ) ( U (8) where n > 2 )
Also, the data sets produced here would need to be "shiftable" to accommodate negative numbers in the graph range. This gets tricky with zero-crossings. Say I've got a data range of -10 to 9990. Assuming we use 10,000 for s and 10 for v (as the row in the chart that has an exact matching range). We can shift the range by ten to come up with a set of points that encompass the data:
-10 990 1990 2990 3990 4990 5990 6990 7990 8990 9990
However, I wouldn't argue that that is the best set of values for our data. Assuming that we don't want fractional y-axis points, we need to keep moving up the chart to find a graph range that suits all of our needs. The first candidate is v=8, shifted by 1250:
-1250 0 1250 2500 3750 5000 6250 7500 8750 10000 11250
Which is not bad. By comparison, my implementation yields:
-1200 0 1200 2400 3600 4800 6000 7200 8400 9600 10800
(although with a minimum step that yields a zero-crossing calculated at 1110 and a maximum allowable step at 10% greater being 1221, it did not get as far as to even audition a step of 1250)
It is of course possible that we could arrive at the same set of numbers if s can be a number that is not a power of ten, or is we expand the range of possible values for v.
Matt
tim2718281 11-12-2009, 08:02 AM Tim...that's an interesting concept...I'm curious how you'd implement it given the inputs of only the endpoints of the data range.
OK
1) take input min_value, max_value, number_of_steps
2) calculate range, = max(max_value, 0 - min_value, max_value - min_value)
(Both min_value and max_value can have the same sign, and the range must include 0)
3) calculate the minimum and maximum stepsize; the maximum is simply the range, and the minimum is the range divided by the number of steps.
4) Now the tricky bit; I suggested candidates might be of the form (u/v)*10^p; and you said you only wanted integers.
So have three nested loops, with u and v ranging from 1 to 10, and p from 1 to some maximum ... eg 10.
Calculate the result from the formula (u/v)*10^p, and if it's an integer between minum and maximum stepsize, add it to the list of candidates.
5) Sort the list of candidates and print them or whatever.
Here's quick and dirty PHP code to explain that:
<?php
$lower = $argv[1];
$upper = $argv[2];
$steps = $argv[3];
$maxpower = 10;
print "lower is " . $lower . "\n" ;
print "upper is " . $upper . "\n" ;
print "steps is " . $steps . "\n" ;
$range = $upper - $lower ;
print "range is " . $range . "\n" ;
$minstep = $range / $steps;
$maxstep = max($range, abs($upper) , abs($lower)) ;
print "minimum step is " . $minstep . "\n" ;
print "maximum step is " . $maxstep . "\n" ;
print "calculating ...";
$maxval = 10;
$ncandidates = 0;
for ($u = 1; $u <= 10; $u++) {
for ($v = 1; $v <= 10; $v++) {
$baseval = $u / $v ;
for ($p = 0; $p <= $maxpower; $p++) {
$factor = pow(10 , $p);
$t = $baseval * $factor ;
if ((floor($t) == $t) & $t >= $minstep & $t <= $maxstep) {
$candidates[$ncandidates++] = $t;
}
}
}
}
sort($candidates);
var_dump(array_unique($candidates));
?>
And here is the result of
php graph.php -10 9990 10
[tim@localhost programming]$ php graph.php -10 9990 10
lower is -10
upper is 9990
steps is 10
range is 10000
minimum step is 1000
maximum step is 10000
calculating ...array(26) {
[0]=>
int(1000)
[12]=>
float(1125)
[13]=>
float(1200)
[14]=>
float(1250)
[17]=>
float(1400)
[18]=>
float(1500)
[21]=>
float(1600)
[22]=>
float(1750)
[23]=>
float(1800)
[24]=>
int(2000)
[31]=>
float(2250)
[32]=>
float(2500)
[36]=>
int(3000)
[40]=>
float(3500)
[41]=>
float(3750)
[42]=>
float(4000)
[46]=>
float(4500)
[47]=>
float(5000)
[54]=>
float(6000)
[57]=>
float(6250)
[58]=>
float(7000)
[60]=>
float(7500)
[62]=>
float(8000)
[65]=>
float(8750)
[66]=>
int(9000)
[68]=>
int(10000)
}
Next, you'd write code to calculate the sum of the number of positive steps and the number of negative steps.
So for -10 to 9990 and candidate 1000, there would be 10 positive steps and 1 negative step, for a total of 11, which is more than allowed. So 1000 would be excluded.
Next is 1125, which would give 9 positive steps and one negative, which is 10 steps; so that's a possibility.
But 1100 is missing! That's because with u and v ranging from 1 to 10, and multiplying by powers of 10, there's so way to generate multiples of 11. (Sure, 1100 would be excluded anyway from -10 to 9990, but it ought to be a candidate. Candidates depend on range, not specific values.)
So a quick update to the code, to have u and v range from say 1 to 16.
And now we get
[0]=>
int(1000)
[18]=>
float(1100)
[20]=>
float(1125)
[21]=>
float(1200)
[24]=>
float(1250)
[29]=>
int(1300)
[31]=>
float(1375)
[32]=>
float(1400)
[35]=>
float(1500)
[41]=>
float(1600)
[44]=>
float(1625)
[45]=>
float(1750)
[47]=>
float(1800)
[48]=>
float(1875)
[50]=>
int(2000)
...
|