Scaling using Perfect Models

This document demonstrates what plots of different scaling metrics look like on perfect models with well understood properties. These plots demonstrate the metrics outlined in "Formal Metrics for Large-Scale Parallel Performance" by Moreland and Oldfield.

In [1]:
plot_width = 400
plot_height = 300

Creating the Data

First the preliminaries. Here are the Python modules we depend on.

In [2]:
import numpy
import toyplot.pdf

Pick some problem size.

In [3]:
data_size = 10000

Create an array representing the number of processing elements. We arbitrarily pick the range from 2^8 to 2^12.

In [4]:
def create_pe_array(low, high):
    return numpy.asfarray(numpy.array(xrange(low, high+1, low/2)))

pe = create_pe_array(pow(2,4), pow(2,12))

Create times for perfect scaling.

In [5]:
def create_perfect_times(data_size, pe):
    return data_size/pe

perfect_times = create_perfect_times(data_size, pe)

Create times for scaling with a logarithmic overhead with respect to the number of processing elements.

In [6]:
def create_log_pe_times(data_size, pe, scale=0.1):
    return create_perfect_times(data_size, pe) + scale*numpy.log2(pe)

log_pe_times = create_log_pe_times(data_size, pe)

Create times for scaling with an overhead that is the square of the processing elements (perhaps from all-to-all communication).

In [7]:
def create_sqr_pe_times(data_size, pe, scale=0.0000001):
    return create_perfect_times(data_size, pe) + scale*numpy.power(pe, 2)

sqr_pe_times = create_sqr_pe_times(data_size, pe)

Times for scaling with an overhead that is linear with respect to the processing elements.

In [8]:
def create_linear_pe_times(data_size, pe, scale=0.0007):
    return create_perfect_times(data_size, pe) + scale*pe

linear_pe_times = create_linear_pe_times(data_size, pe)

Scaling behavior based on Amdahl's law.

In [9]:
def create_amdahl_times(data_size, pe, serial_fraction=0.01):
    return data_size*(serial_fraction + (1-serial_fraction)/pe)

amdahl_times = create_amdahl_times(data_size, pe, 10.0/data_size)
In [10]:
#linear_pe_times = data_size*numpy.power(pe, -0.25)

Standard Time Plots

Here we plot these theoretical scale times using the standard overall time metric.

The first plot is of the time of a perfectly scaling algorithm. This plot with linear scaling of the axes is saved as TimeLinearPerfect.pdf. The shape is a hyperbolic curve.

In [11]:
start_scale = numpy.searchsorted(pe, 256)

canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Time')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([0, pe[-1]], ['',''])
axes.y.ticks.locator = \
    toyplot.locator.Explicit([0, perfect_times[start_scale]], ['',''])
axes.y.domain.max = perfect_times[start_scale]

axes.plot(pe, perfect_times)

toyplot.pdf.render(canvas, 'TimeLinearPerfect.pdf')    
canvas
Out[11]:
Number of Processing ElementsTime

This plot is also the time of a perfectly scaling algorithm, this time with log-log scaling on the axes. This plot is saved as TimeLogPerfect.pdf. The shape is a line with a negative slope.

In [12]:
axes.x.scale = 'log2'
axes.y.scale = 'log10'
axes.y.domain.max = perfect_times[0]
axes.x.ticks.locator = \
    toyplot.locator.Explicit([pe[0], pe[-1]], ['',''])
axes.y.ticks.locator = \
    toyplot.locator.Explicit([perfect_times[-1], perfect_times[start_scale]], ['',''])

toyplot.pdf.render(canvas, 'TimeLogPerfect.pdf')    
canvas
Out[12]:
Number of Processing ElementsTime

Now we repeat the plots with some theoretic measurements from an algorithm that scales less than perfectly. This first plot has an overhead that grows logarithmically with respect to the number of processing elements. That is, if you double the number of processing elements, then the overhead is incremented by 1. This is definitely a noticeable overhead that is generally tolerated but should be documented. First we plot the time response using linear scales. The resulting plot, stored in TimeLinearLog.pdf, is almost indistinguishable from a perfect response.

In [13]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Time')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([0, pe[-1]], ['',''])
axes.y.ticks.locator = \
    toyplot.locator.Explicit([0, log_pe_times[start_scale]], ['',''])
axes.y.domain.max = log_pe_times[start_scale]

axes.plot(pe, log_pe_times)

axes.plot(pe, log_pe_times[0]*pe[0]/pe,
          style={'stroke':'gray', 'stroke-width':0.5, 'stroke-dasharray':'5,5'})

toyplot.pdf.render(canvas, 'TimeLinearLog.pdf')
canvas
Out[13]:
Number of Processing ElementsTime

Repeat the same time response plot on the same data, except this time use log-log scaling on the plot's axes. The resulting plot, stored in TimeLogLog.pdf, is the same straight line as the perfect scaling with a slight uptick at the lower right.

In [14]:
axes.x.scale = 'log2'
axes.y.scale = 'log10'
axes.y.domain.max = log_pe_times[0]
axes.x.ticks.locator = \
    toyplot.locator.Explicit([pe[0], pe[-1]], ['',''])
axes.y.ticks.locator = \
    toyplot.locator.Explicit([log_pe_times[-1], log_pe_times[start_scale]], ['',''])

toyplot.pdf.render(canvas, 'TimeLogLog.pdf')
canvas
Out[14]:
Number of Processing ElementsTime

Now we take a look at what an algorithm that does not scale well looks like. This theoretical algorithm timing has an overhead proportional to the number of processing elements, which is not very good scaling. The time response, shown in TimeLinearLinear.pdf, looks about the same as the perfect response.

In [15]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Time')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([0, pe[-1]], ['',''])
axes.y.ticks.locator = \
    toyplot.locator.Explicit([0, linear_pe_times[start_scale]], ['',''])
axes.y.domain.max = linear_pe_times[start_scale]

axes.plot(pe, linear_pe_times)

axes.plot(pe, linear_pe_times[0]*pe[0]/pe,
          style={'stroke':'gray', 'stroke-width':0.5, 'stroke-dasharray':'5,5'})

toyplot.pdf.render(canvas, 'TimeLinearLinear.pdf')
canvas
Out[15]:
Number of Processing ElementsTime

Repeat the same time response plot on the same data, except this time use log-log scaling on the plot's axes. The resulting plot, stored in TimeLogLinear.pdf, at least has a noticible uptick in the lower right, but not nearly enough to represent the very poor scaling of the algorithm.

In [16]:
axes.x.scale = 'log2'
axes.y.scale = 'log10'
axes.y.domain.max = linear_pe_times[0]
axes.x.ticks.locator = \
    toyplot.locator.Explicit([pe[0], pe[-1]], ['',''])
axes.y.ticks.locator = \
    toyplot.locator.Explicit([linear_pe_times[-1], linear_pe_times[start_scale]], ['',''])

toyplot.pdf.render(canvas, 'TimeLogLinear.pdf')
canvas
Out[16]:
Number of Processing ElementsTime

Rate Plots

Repeat the plots for the theoretical time measurements above, except this time plot the rate metric.

This is the rate response of the perfectly scaling algorithm. Shown in RatePerfect.pdf, it is a straight line with a positive slope.

In [17]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Rate')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([0, pe[-1]], ['',''])
axes.y.ticks.locator = toyplot.locator.Explicit([0], [''])

axes.plot(pe, data_size/perfect_times)

toyplot.pdf.render(canvas, 'RatePerfect.pdf')
canvas
Out[17]:
Number of Processing ElementsRate

Here is the rate response of the algorithm with an overhead that is logarithmic with respect the number of processing elements (noticeable but tolerable). Shown in RateLog.pdf, we can easily see the drop in efficiency.

In [18]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Rate')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([0, pe[-1]], ['',''])
axes.y.ticks.locator = toyplot.locator.Explicit([0], [''])

axes.plot(pe, data_size/log_pe_times)

axes.plot(pe, pe*data_size/(log_pe_times[0]*pe[0]),
          style={'stroke':'gray', 'stroke-width':0.5, 'stroke-dasharray':'5,5'})

toyplot.pdf.render(canvas, 'RateLog.pdf')
canvas
Out[18]:
Number of Processing ElementsRate

Here is the rate response of the algorithm with an overhead that is linear with respect the number of processing elements (usually too high to be practical for large scales). Shown in RateLinear.pdf, we can easily see how the performance flatlines.

In [19]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Rate')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([0, pe[-1]], ['',''])
axes.y.ticks.locator = toyplot.locator.Explicit([0], [''])

axes.plot(pe, data_size/linear_pe_times)

axes.plot(pe, pe*data_size/(linear_pe_times[0]*pe[0]),
          style={'stroke':'gray', 'stroke-width':0.5, 'stroke-dasharray':'5,5'})

toyplot.pdf.render(canvas, 'RateLinear.pdf')
canvas
Out[19]:
Number of Processing ElementsRate

Efficiency Plots

Repeat the plots for the theoretical time measurements above, except this time plot the efficiency metric.

In [20]:
def compute_efficiency_single_series(pe, times):
    costs = pe*times
    optimal_cost = numpy.min(costs)
    return optimal_cost/costs

This is the efficiency response of the perfectly scaling algorithm. Shown in EfficiencyPerfect.pdf, it is by definition always 1.

In [21]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Efficiency')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([numpy.min(pe), numpy.max(pe)], ['',''])
axes.y.domain.min = 0
axes.y.domain.max = 1
axes.y.ticks.show = True

axes.plot(pe, compute_efficiency_single_series(pe, perfect_times))

toyplot.pdf.render(canvas, 'EfficiencyPerfect.pdf')
canvas
Out[21]:
Number of Processing Elements0.00.51.0Efficiency

Here is the efficiency response of the algorithm with an overhead that is logarithmic with respect the number of processing elements (noticeable but tolerable). Shown in EfficiencyLog.pdf, we can easily see the drop in efficiency.

In [22]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Efficiency')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([numpy.min(pe), numpy.max(pe)], ['',''])
axes.y.domain.min = 0
axes.y.domain.max = 1
axes.y.ticks.show = True

axes.plot(pe, compute_efficiency_single_series(pe, log_pe_times))

toyplot.pdf.render(canvas, 'EfficiencyLog.pdf')
canvas
Out[22]:
Number of Processing Elements0.00.51.0Efficiency

Here is the efficiency response of the algorithm with an overhead that is linear with respect the number of processing elements (usually too high to be practical for large scales). Shown in EfficiencyLinear.pdf, we can easily see how the performance drops.

In [23]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Efficiency')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([numpy.min(pe), numpy.max(pe)], ['',''])
axes.y.domain.min = 0
axes.y.domain.max = 1
axes.y.ticks.show = True

axes.plot(pe, compute_efficiency_single_series(pe, linear_pe_times))

toyplot.pdf.render(canvas, 'EfficiencyLinear.pdf')
canvas
Out[23]:
Number of Processing Elements0.00.51.0Efficiency

Combined Scale Plots

A nice feature of the rate and efficiency metrics as we defined them is that you can plot measurements from runs on different data and compare them on the same axes. Below are examples of doing this for rate (CombinedRate.pdf) and efficiency (CombinedEfficiency.pdf).

In [24]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Rate')
axes.x.ticks.locator = \
    toyplot.locator.Explicit([0, pe[-1]], ['',''])
axes.y.ticks.locator = toyplot.locator.Explicit([0], [''])

min_cost_per_unit = numpy.Inf
for scale in xrange(0, 4):
    data_size = 10000*pow(2,scale)
    pe = create_pe_array(pow(2,8+scale), pow(2,12+scale))
    times = create_log_pe_times(data_size, pe)
    rates = data_size/times
    axes.plot(pe, rates)
    axes.text(pe[-1], rates[-1], '%d TB' % (data_size/10000),
              style={'text-anchor':'start',
                     '-toyplot-anchor-shift':'5px'})
    costs_per_unit = (pe*times)/data_size
    min_cost_per_unit = min(min_cost_per_unit, numpy.min(costs_per_unit))
    
pe = create_pe_array(pow(2,8), pow(2,14)+pow(2,12))
ideal_rate = pe/min_cost_per_unit
axes.plot(pe, ideal_rate,
          style={'stroke':'gray', 'stroke-width':0.5, 'stroke-dasharray':'5,5'})
axes.text(pe[-1], ideal_rate[-1], 'Ideal',
          style={'fill':'lightgray', 'baseline-shift':'50%'})

toyplot.pdf.render(canvas, 'CombinedRate.pdf')
canvas
Out[24]:
1 TB2 TB4 TB8 TBIdealNumber of Processing ElementsRate
In [25]:
canvas = toyplot.Canvas(plot_width, plot_height)
axes = canvas.axes(xlabel='Number of Processing Elements',
                   ylabel='Efficiency')
axes.y.domain.min = 0
axes.y.domain.max = 1
axes.x.domain.min = 0
axes.x.ticks.locator = \
    toyplot.locator.Explicit([numpy.min(pe), numpy.max(pe)], ['',''])

for scale in xrange(0, 4):
    data_size = 10000*pow(2,scale)
    pe = create_pe_array(pow(2,8+scale), pow(2,12+scale))
    times = create_log_pe_times(data_size, pe)
    costs_per_unit = (pe*times)/data_size
    efficiencies = min_cost_per_unit/costs_per_unit
    axes.plot(pe, efficiencies)
    axes.text(pe[-1], efficiencies[-1], '%d TB' % (data_size/10000),
              style={'baseline-shift':'-65%',
                     'text-anchor':'middle'})

toyplot.pdf.render(canvas, 'CombinedEfficiency.pdf')
canvas
Out[25]:
1 TB2 TB4 TB8 TBNumber of Processing Elements0.00.51.0Efficiency
In [ ]: