By Dr. Cara Touretzky, Dr. Robert Luce, and Dr. David Torres Sanchez

Why Pay Attention to Hardware and Algorithm Choice? 

For those of us who like to focus on the art of modeling, it’s easy to take for granted the influence of different hardware options on optimization performance. And with a reliable solver like Gurobi, modelers don’t have to spend time contemplating the algorithms being deployed under the hood.  

But hardware options have been evolving, and so have the algorithms used in the optimization space. While the simplex and barrier methods for solving LPs are well established (and continuing to improve each year), the Primal-Dual Hybrid Gradient method (PDHG, also known as PDLP when enhanced and adapted specifically to Linear Programming) has recently received significant attention because it is amenable to a GPU-based implementation.  

With new GPU-enabled algorithms for barrier and PDHG, the average optimization user benefits from understanding the impact of hardware and algorithm choice when deploying optimization models—especially if they are interested in finding the combination that results in the fastest possible solution times.   

Spoiler alert: there is no silver bullet when it comes to hardware and algorithm selection. The classic advice of “it depends on your model” still applies. At Gurobi, we have done a thorough comparison of these algorithms and hardware options. We want to share our methodology and industry-proven benchmark results with the optimization community to help navigate these choices.  

This article will demystify the applicable algorithms and reveal new opportunities for the field of optimization. Specifically, we have observed that giant LPs (and possibly MIPs that rely on giant LP relaxations) show the most immediate promise for experiencing a significant speed boost from GPU-enabled LP algorithms.   

Gurobi has a proven track record of taking new innovations in the optimization space and turning them into practical tools for industry applications. Historically, this has been achieved through close collaboration with Gurobi customers, who are pushing the frontier of our solver’s capabilities. Now that Gurobi has established a benchmark methodology for comparing the various hardware and algorithm choices, we are ready and waiting to test your real-world models. Please reach out through How do I get support for using Gurobi? if you are interested in testing your models’ performance with a GPU-accelerated solver.   

Demystifying the Optimization and GPU Space 

We have investigated multiple setups for solving giant LPs, with a combination of new hardware and algorithm options. 

Hardware options we’ve investigated: 

  1. CPU 1: The CPU capabilities of the Grace Hopper Superchip 200 (GH200) with 72 cores
  2. CPU 2: AMD’s EPYC 9655 with 96 cores
  3. GPU: The GPU capabilities of the GH200 chip

For the CPU options, all benchmarks are run using 64 cores, rather than the total available.  The additional benefit of using more than 64 cores on either machine was diminishing, and using the same number of cores for all tests enabled a simple comparison.   

Gurobi’s algorithmic options for giant LPs are (1) the barrier algorithm, and (2) PDHG, a first-order method. These two algorithms differ significantly in their computational footprint. We’ll summarize the most important properties: 

  1. For the barrier algorithm, the heaviest computational step in each iteration is the solution of a large, sparse system of linear equations. On the GPU, we are using NVIDIA’s cuDSS library to perform this operation, while on the CPU we’re using our standard, in-house, multi-threaded algorithm.
  2. In PDHG, the most significant algorithmic operation per iteration is the multiplication of a sparse matrix (or its transpose) with a dense vector. On the GPU, we are using NVIDIA’s cuSPARSE library to execute this operation and keep all data resident in the GPU memory, so that no expensive memory transfers are needed. On the CPU, such matrix vector products are computed using our standard, in-house, multi-threaded algorithm. 

If you’re wondering why the primal and dual simplex methods are not part of this discussion, it is because these algorithms are already difficult to parallelize on CPU hardware, and even more so on the GPU. 

A Comment on Convergence Tolerance and Crossover 

Convergence behavior is different for the main LP algorithms. Barrier methods have the benefit of displaying locally quadratic convergence, so once the method gets close to a solution, it is very quick to shrink the residual to a small value. On the other hand, PDHG displays linear behavior at best, so it can take longer to reach a very low tolerance threshold.    

The tolerance that we use to define convergence can vary, but industry practices seem to generally agree that residuals less than 1e-6 are acceptable for a quality solution. What’s the risk of increasing the convergence tolerance when you are trying to get a solution as fast as possible? To answer that question, we have to talk about crossover and how it can add extra time to improve the result of the barrier and PDHG algorithms.   

Crossover is an essential but often forgotten step needed when using a non-simplex method for LPs. To summarize, crossover helps translate the LP solution into a “clean basic solution.” Solutions from an interior point algorithm (or first-order method) normally have noisy values close to zero (we can call this dense) and, more importantly, due to these, they may have violations in various metrics (such as primal feasibility, dual feasibility, and complementary slackness). Crossover is very effective at cleaning up these violations and providing a clean and sparse basic solution. 

Most of our customers regard a basic solution as essential to interpretability. Therefore, when evaluating these new algorithms, we feel it is necessary to also account for the time needed to complete crossover. The importance of this will be made even clearer in the results below, because for some algorithms, there is a clear tradeoff between accuracy and the time needed to complete crossover. 

Computational Results 

Of course, one would hope that the “best” or “fastest” machine would solve all optimization problems equally fast. In practice, however, one can observe significant variation across the different hardware options and the two algorithms mentioned above.  

We give two insightful examples from the MIPLIB public library of models, and will conclude with a comment on the results from our internal set of benchmarks. The MIPLIB examples were selected because they show the possible variation in solution times due to the combination of model, algorithm, and hardware selection. For a more in-depth analysis of these results and additional examples, see New Options for Solving Giant LPs – Gurobi Optimization 

Key takeaways from our experiments: 

  • GPUs typically perform PDHG and barrier iterations much faster than CPUs, but that doesn’t always translate directly into big overall reductions in runtime. 
  • PDHG may be the algorithm of choice for some models, even on a CPU.  
  • There are still cases where barrier is the algorithm of choice, even on the CPU. The improvement from running barrier on GPU can be significant, but it is not as consistent as PDHG. 
  • Increasing convergence tolerance makes things faster, especially for PDHG—but crossover may dominate the total time to achieve a quality solution. That is a new tradeoff for practitioners.   

Example Model 1:   

The first model (“rwth-timetable”) is an LP relaxation of a time tabling problem with a lot of rows and columns that result in high computational effort in the barrier algorithm. The statistics are as follows: 

rwth-timetable: 440,134 rows, 923,564 columns and 4,510,786 nonzeros 

We visualize the comparison of the different hardware and algorithmic options explained above by means of a bar plot:  


Each bar of the plot corresponds to a different hardware and algorithm combination, noting that PDHG has several columns, each with different relative tolerance termination criteria. Each bar is split into two colors: time spent performing iterations in the respective algorithm (red), and the time spent in crossover (green). From the hardware listed above, the first four bars correspond to CPU 1, the following bar corresponds to CPU 2, and the last two bars correspond to GPU.  

In this first example, we can see that PDHG with a low convergence tolerance (1e-6) is the clear winner, regardless of the hardware, CPU 1 or GPU. As shown by the last two bars, both algorithms running on the GPU bring some improvement to their CPU counterparts. Specifically, for PDHG on the GPU, we see an improvement of 62.71% over CPU 1. For barrier on the GPU, we see a 50% improvement over CPU 1, and only a modest 7% improvement over CPU 2.  

Another interesting takeaway is that PDHG with higher tolerances, as expected, spends less time performing iterations—but unfortunately, this doesn’t help the total solution time, as crossover takes a large proportion of this. This is because crossover largely benefits from higher-quality solutions. 

Example Model 2: 

The second problem is the LP relaxation of an open-pit mining problem (“rmine25”) that has a huge number of rows: 

rmine25: 2,953,849 rows, 326,599 columns, 7,182,744 nonzeros 

In the rmine25 results, we can see that barrier is the algorithm of choice across all hardware options, with an even further advantage when using the fastest CPU 2. Both CPU 2 and the GPU version for barrier show a speedup nearly 40% over CPU 1. Interestingly, the time spent in crossover dominates the PDHG runtime, even for a convergence tolerance as small as 1e-6.

With results like this, it is hard to predict which algorithm and hardware choice is the best for a given model, without even mentioning the effect of termination tolerance on crossover. Put differently, it is safe to say there is no one-size-fits-all combination that is set to win! And we haven’t even factored into this discussion a range of other interesting phenomena: For example, although on these two models the barrier runtime of CPU 2 and GPU are largely comparable, we’ve seen that the actual throughput for the factorization that cuDSS is doing can be much larger than the throughput we achieve on the CPU. 

Our Research Continues

Whether you identify as an “optimization person” who wanted to learn more about your hardware options, or a “high-performance computing person” who wanted to learn more about optimization, we hope this article guided you to the same place of appreciation for the latest initiatives in the field of GPU-enabled optimization algorithms.   

As was stated earlier, there is no silver bullet when it comes to hardware and algorithm selection for optimization. We still recommend treating every model as a unique creation that must be tested to see which combination is the best fit. The two examples shown earlier reflect the impact that these choices can have, given the currently available state-of-the-art hardware options.  

Because the hardware and related libraries (recall, we leveraged NVIDIA’s cuDSS and cuSparse libraries for our GPU-enabled algorithms) are continuously evolving, we expect the results of our benchmark to change rapidly—and it’s possible many more use cases will benefit from barrier and PDHG on GPU in the future.  

At this point you may be wondering why we are only sharing results for models from the MIPLIB database when Gurobi has many real-world case studies. Our preliminary and internal benchmarks show only 72 out of 2633 models where PDHG is better than our default settings (only 17 of those show more than a factor of 2 improvement). But by nature of our collection, this set is highly biased towards LPs that we can already solve with either Simplex or IPMs. Why does our test set not include giant LPs that we cannot solve? This is more of a philosophical question, because why would someone write a giant LP if they expect that current solver technology cannot handle it?    

This is an active research area, and perception of what is “giant” is always shifting. With the new GPU-enabled algorithms, our initial impression is that they will offer the most benefit to models so large that people have not yet dared to build them. Of course, we’re also exploring models that may not be “giant,” but have a level of complexity and heavy compute needs that will warrant using the new algorithms and GPU hardware.   

Interested in qualifying your model for GPU-accelerated optimization? Please reach out through How do I get support for using Gurobi? 

Guidance for Your Journey

30 Day Free Trial for Commercial Users

Start solving your most complex challenges, with the world's fastest, most feature-rich solver.

Always Free for Academics

We make it easy for students, faculty, and researchers to work with mathematical optimization.

Try Gurobi for Free

Choose the evaluation license that fits you best, and start working with our Expert Team for technical guidance and support.

Evaluation License
Get a free, full-featured license of the Gurobi Optimizer to experience the performance, support, benchmarking and tuning services we provide as part of our product offering.
Cloud Trial

Request free trial hours, so you can see how quickly and easily a model can be solved on the cloud.

Academic License
Gurobi provides free, full-featured licenses for coursework, teaching, and research at degree-granting academic institutions. Academics can receive guidance and support through our Community Forum.

Search

Gurobi Optimization