By Dr. Cara Touretzky, Dr. Robert Luce, and Dr. David Torres Sanchez
For those of us who like to focus on the art of modeling, it’s easy to take for granted the influence of different hardware options on optimization performance. And with a reliable solver like Gurobi, modelers don’t have to spend time contemplating the algorithms being deployed under the hood.
But hardware options have been evolving, and so have the algorithms used in the optimization space. While the simplex and barrier methods for solving LPs are well established (and continuing to improve each year), the Primal-Dual Hybrid Gradient method (PDHG, also known as PDLP when enhanced and adapted specifically to Linear Programming) has recently received significant attention because it is amenable to a GPU-based implementation.
With new GPU-enabled algorithms for barrier and PDHG, the average optimization user benefits from understanding the impact of hardware and algorithm choice when deploying optimization models—especially if they are interested in finding the combination that results in the fastest possible solution times.
Spoiler alert: there is no silver bullet when it comes to hardware and algorithm selection. The classic advice of “it depends on your model” still applies. At Gurobi, we have done a thorough comparison of these algorithms and hardware options. We want to share our methodology and industry-proven benchmark results with the optimization community to help navigate these choices.
This article will demystify the applicable algorithms and reveal new opportunities for the field of optimization. Specifically, we have observed that giant LPs (and possibly MIPs that rely on giant LP relaxations) show the most immediate promise for experiencing a significant speed boost from GPU-enabled LP algorithms.
Gurobi has a proven track record of taking new innovations in the optimization space and turning them into practical tools for industry applications. Historically, this has been achieved through close collaboration with Gurobi customers, who are pushing the frontier of our solver’s capabilities. Now that Gurobi has established a benchmark methodology for comparing the various hardware and algorithm choices, we are ready and waiting to test your real-world models. Please reach out through How do I get support for using Gurobi? if you are interested in testing your models’ performance with a GPU-accelerated solver.
We have investigated multiple setups for solving giant LPs, with a combination of new hardware and algorithm options.
Hardware options we’ve investigated:
For the CPU options, all benchmarks are run using 64 cores, rather than the total available. The additional benefit of using more than 64 cores on either machine was diminishing, and using the same number of cores for all tests enabled a simple comparison.
Gurobi’s algorithmic options for giant LPs are (1) the barrier algorithm, and (2) PDHG, a first-order method. These two algorithms differ significantly in their computational footprint. We’ll summarize the most important properties:
If you’re wondering why the primal and dual simplex methods are not part of this discussion, it is because these algorithms are already difficult to parallelize on CPU hardware, and even more so on the GPU.
Convergence behavior is different for the main LP algorithms. Barrier methods have the benefit of displaying locally quadratic convergence, so once the method gets close to a solution, it is very quick to shrink the residual to a small value. On the other hand, PDHG displays linear behavior at best, so it can take longer to reach a very low tolerance threshold.
The tolerance that we use to define convergence can vary, but industry practices seem to generally agree that residuals less than 1e-6 are acceptable for a quality solution. What’s the risk of increasing the convergence tolerance when you are trying to get a solution as fast as possible? To answer that question, we have to talk about crossover and how it can add extra time to improve the result of the barrier and PDHG algorithms.
Crossover is an essential but often forgotten step needed when using a non-simplex method for LPs. To summarize, crossover helps translate the LP solution into a “clean basic solution.” Solutions from an interior point algorithm (or first-order method) normally have noisy values close to zero (we can call this dense) and, more importantly, due to these, they may have violations in various metrics (such as primal feasibility, dual feasibility, and complementary slackness). Crossover is very effective at cleaning up these violations and providing a clean and sparse basic solution.
Most of our customers regard a basic solution as essential to interpretability. Therefore, when evaluating these new algorithms, we feel it is necessary to also account for the time needed to complete crossover. The importance of this will be made even clearer in the results below, because for some algorithms, there is a clear tradeoff between accuracy and the time needed to complete crossover.
Of course, one would hope that the “best” or “fastest” machine would solve all optimization problems equally fast. In practice, however, one can observe significant variation across the different hardware options and the two algorithms mentioned above.
We give two insightful examples from the MIPLIB public library of models, and will conclude with a comment on the results from our internal set of benchmarks. The MIPLIB examples were selected because they show the possible variation in solution times due to the combination of model, algorithm, and hardware selection. For a more in-depth analysis of these results and additional examples, see New Options for Solving Giant LPs – Gurobi Optimization.
Key takeaways from our experiments:
Example Model 1:
The first model (“rwth-timetable”) is an LP relaxation of a time tabling problem with a lot of rows and columns that result in high computational effort in the barrier algorithm. The statistics are as follows:
rwth-timetable: 440,134 rows, 923,564 columns and 4,510,786 nonzeros
We visualize the comparison of the different hardware and algorithmic options explained above by means of a bar plot:
Each bar of the plot corresponds to a different hardware and algorithm combination, noting that PDHG has several columns, each with different relative tolerance termination criteria. Each bar is split into two colors: time spent performing iterations in the respective algorithm (red), and the time spent in crossover (green). From the hardware listed above, the first four bars correspond to CPU 1, the following bar corresponds to CPU 2, and the last two bars correspond to GPU.
In this first example, we can see that PDHG with a low convergence tolerance (1e-6) is the clear winner, regardless of the hardware, CPU 1 or GPU. As shown by the last two bars, both algorithms running on the GPU bring some improvement to their CPU counterparts. Specifically, for PDHG on the GPU, we see an improvement of 62.71% over CPU 1. For barrier on the GPU, we see a 50% improvement over CPU 1, and only a modest 7% improvement over CPU 2.
Another interesting takeaway is that PDHG with higher tolerances, as expected, spends less time performing iterations—but unfortunately, this doesn’t help the total solution time, as crossover takes a large proportion of this. This is because crossover largely benefits from higher-quality solutions.
Example Model 2:
The second problem is the LP relaxation of an open-pit mining problem (“rmine25”) that has a huge number of rows:
rmine25: 2,953,849 rows, 326,599 columns, 7,182,744 nonzeros
In the rmine25 results, we can see that barrier is the algorithm of choice across all hardware options, with an even further advantage when using the fastest CPU 2. Both CPU 2 and the GPU version for barrier show a speedup nearly 40% over CPU 1. Interestingly, the time spent in crossover dominates the PDHG runtime, even for a convergence tolerance as small as 1e-6.
With results like this, it is hard to predict which algorithm and hardware choice is the best for a given model, without even mentioning the effect of termination tolerance on crossover. Put differently, it is safe to say there is no one-size-fits-all combination that is set to win! And we haven’t even factored into this discussion a range of other interesting phenomena: For example, although on these two models the barrier runtime of CPU 2 and GPU are largely comparable, we’ve seen that the actual throughput for the factorization that cuDSS is doing can be much larger than the throughput we achieve on the CPU.
Whether you identify as an “optimization person” who wanted to learn more about your hardware options, or a “high-performance computing person” who wanted to learn more about optimization, we hope this article guided you to the same place of appreciation for the latest initiatives in the field of GPU-enabled optimization algorithms.
As was stated earlier, there is no silver bullet when it comes to hardware and algorithm selection for optimization. We still recommend treating every model as a unique creation that must be tested to see which combination is the best fit. The two examples shown earlier reflect the impact that these choices can have, given the currently available state-of-the-art hardware options.
Because the hardware and related libraries (recall, we leveraged NVIDIA’s cuDSS and cuSparse libraries for our GPU-enabled algorithms) are continuously evolving, we expect the results of our benchmark to change rapidly—and it’s possible many more use cases will benefit from barrier and PDHG on GPU in the future.
At this point you may be wondering why we are only sharing results for models from the MIPLIB database when Gurobi has many real-world case studies. Our preliminary and internal benchmarks show only 72 out of 2633 models where PDHG is better than our default settings (only 17 of those show more than a factor of 2 improvement). But by nature of our collection, this set is highly biased towards LPs that we can already solve with either Simplex or IPMs. Why does our test set not include giant LPs that we cannot solve? This is more of a philosophical question, because why would someone write a giant LP if they expect that current solver technology cannot handle it?
This is an active research area, and perception of what is “giant” is always shifting. With the new GPU-enabled algorithms, our initial impression is that they will offer the most benefit to models so large that people have not yet dared to build them. Of course, we’re also exploring models that may not be “giant,” but have a level of complexity and heavy compute needs that will warrant using the new algorithms and GPU hardware.
Interested in qualifying your model for GPU-accelerated optimization? Please reach out through How do I get support for using Gurobi?
Latest news and releases
Choose the evaluation license that fits you best, and start working with our Expert Team for technical guidance and support.
Request free trial hours, so you can see how quickly and easily a model can be solved on the cloud.