Pipeshift Reduces GPU Usage for AI Inference by 75% with Modular Interface Engine

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more

DeepSeeks R1 released this week was a turning point in the field of AI. No one would have thought that a Chinese startup would be the first to abandon a reasoning model that fits OpenAI’s o1 while making it open source (in line with OpenAI’s original mission).

Companies can easily download the R1 weights via Hugging Face, but access has never been an issue – over 80% of teams use open models or plan to. The real culprit is deployment. When you choose hyperscaler services like Vertex AI, you are locked into a specific cloud. On the other hand, if you go it alone and build in-house, the challenge comes in terms of resource constraints as you need to set up a dozen different components to start with, not to mention optimizing or scaling downstream.

To address this challenge, Y Combinator and SenseAI have provided support Pipe displacement launches an end-to-end platform that enables enterprises to train, deploy and deploy open source generative AI models – LLMs, vision models, audio models and image models – across any cloud or on-premises GPUs to scale. The company competes with a fast-growing domain that includes Baseten, Domino Data Lab, AI together and Simplismart.

The most important value proposition? Pipeshift uses a modular inference engine that can be quickly optimized for speed and efficiency. This allows teams to not only deploy 30x faster, but also achieve more with the same infrastructure, resulting in cost savings of up to 60%.

Imagine using just one GPU to make inferences worth four GPUs.

The orchestration bottleneck

If you need to do different models, sew a Functional MLOps stack in-house – from access to computing power, training and tuning to production-level deployment and monitoring – becomes a problem. You have to set up 10 different inference components and instances to get things running, and then invest thousands of development hours on even the smallest optimizations.

“An inference engine is made up of multiple components,” Arko Chattopadhyay, co-founder and CEO of Pipeshift, told VentureBeat. “Each combination of these components creates its own engine with different performance for the same workload. Determining the optimal combination to maximize ROI requires weeks of repeated experimentation and fine-tuning of settings. In most cases, it can take years for internal teams to develop pipelines that enable infrastructure flexibility and modularization, which pushes companies back in the market while amassing massive technology debt.”

While there are startups offering platforms for deploying open models in cloud or on-premise environments, most of them are GPU brokers offering unified inference solutions, according to Chattopadhyay. As a result, they manage separate GPU instances for different LLMs, which is not helpful when teams want to save costs and optimize performance.

To address this problem, Chattopadhyay started Pipeshift and developed a framework called Modular Architecture for GPU-based Inference Clusters (MAGIC), which aims to distribute the inference stack into different plug-and-play parts. The work created a Lego-like system that allows teams to configure the right inference stack for their workloads without the hassle of infrastructure engineering.

This allows a team to quickly add or swap out different inference components to assemble a custom inference engine that can squeeze more out of existing infrastructure to meet cost, throughput, or even scalability expectations.

For example, a team could set up a unified inference system where multiple hot-swapped domain-specific LLMs could run on a single GPU and make optimal use of it.

Run four GPU workloads on one

Since claiming to offer a modular inference solution is one thing and implementing that solution is quite another, Pipeshift’s founder was quick to point out the benefits of the company’s offering.

“In terms of operational costs… MAGIC allows you to run LLMs like Llama 3.1 8B at >500 tokens/sec. on a given set of Nvidia GPUs without model quantization or compression,” he said. “This results in a massive reduction in scaling costs as the GPUs can now handle workloads on the order of 20 to 30 times what they were originally able to achieve with the cloud providers’ native platforms.”

The CEO noted that the company is already working with 30 companies on an annual license-based model.

One is a Fortune 500 retailer that initially leveraged four independent GPU instances to run four open, fine-tuned models for its automated support and document processing workflows. Each of these GPU clusters scaled independently, resulting in huge cost increases.

“Large-scale fine-tuning was not possible as datasets grew larger and all pipelines supported single-GPU workloads while having to upload all the data at once. Additionally, there was no automatic scaling support with tools like AWS Sagemaker, making it difficult to ensure optimal utilization of the infrastructure, forcing the company to pre-approve quotas and reserve capacity in advance for theoretical scaling that was only 5% of time reached. Chattopadhyay remarked.

Interestingly, after switching to Pipeshift’s modular architecture, all fine-tuning was reduced to a single GPU instance serving them in parallel, without memory partitioning or model degradation. This reduced the requirement to run these workloads from four GPUs to just a single GPU.

“Without additional optimizations, we were able to scale the capabilities of the GPU to the point where it delivered five times faster tokens for inference and could handle four times higher scaling,” the CEO added. Overall, he said the company saw a 30x faster time to deployment and a 60% reduction in infrastructure costs.

With its modular architecture, Pipeshift aims to position itself as the platform of choice for deploying all cutting-edge open source AI models, including DeepSeek R-1.

However, it will not be an easy path as competitors are constantly evolving their offerings.

For example Simplismart, which increased $7 million a few months agotakes a similar software-optimized approach to inference. Cloud service providers such as Google Cloud and Microsoft Azure are also strengthening their respective offerings, although Chattopadhyay expects these CSPs to be partners rather than competitors in the long run.

“We are a platform for delivering tools and orchestrating AI workloads, just as Databricks was for data intelligence,” he explained. “In most scenarios, most cloud service providers become growth-stage GTM partners because their customers can benefit from Pipeshift on their AWS/GCP/Azure clouds.”

In the coming months, Pipeshift will introduce model evaluation and testing as well as tools to help teams build and scale their datasets. This exponentially accelerates the experimentation and data preparation cycle, allowing customers to use orchestration more efficiently.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily is for you. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read ours Privacy Policy

Thank you for your subscription. Check out more VB newsletter here.

An error has occurred.