Skip to main content
SourceTree logo with a blue background.

Scaling AI Workflows

AI / Agents / LLMs / Semantic Search AI

Over the past year or so I have been involved in a couple of projects that involved creating high-throughput, low-latency AI workflows, integrating several foundational and open source LLM models, vector databases, and 3rd party data providers in a set of complex data generation pipelines.

Below I describe some of the common challenges and solutions I have encountered while working on scaling AI workflows for various companies. If your company is building a product or service that relies on AI and you are facing challenges with scaling, feel free to reach out to discuss how I can help .

Here is a high-level overview of the project areas I have been helping companies with:

  • Orchestration of AI workflows across multiple data pipelines and LLM vendors: OpenAI, Google, Llama, self-hosted models from HuggingFace, etc.
  • Custom data processing and data generation pipelines.
    • Content generation.
    • Embeddings for semantic search.
    • Raw web data extraction, cleaning, validation, and analytics.
  • High availability and scaling strategies for AI workloads.
  • Integration with existing systems: databases, internal and 3rd party APIs.
  • Cost optimization for high-volume AI workloads.

Need for scalability

With the current rush by companies to integrate AI into their products and services, scalability of implementations is often left until later phases. This can lead to significant rework and delays down the line as usage increases and systems struggle to keep up.

From my conversations with founders and engineering leaders I found that there is a clear need for scalable AI infrastructure capable of sustaining high throughput and low latency. As data pipelines grow more complex and integrations across models, databases, cloud data sources, and APIs proliferate, complexity rises, quality suffers, and throughput declines.


Solutions

After working on a few projects in this space, I have developed expertise in setting up scalable AI workflows that can handle high throughput and low latency requirements. This includes using of several open source and commercial tools and platforms, as well efficiently priced services provided by a few companies.

It helps to categorize problems and solutions into a few key areas:

Model request routing and failover

Several use cases require routing requests to different models based on criteria such as cost optimization, response latency, or model capabilities. This can be achieved either through custom routing logic or using existing third party platforms that provide this functionality out of the box, for example OpenRouter, PortKey, or services like LiteLLM.

Scaling LLM hosting

While major foundational model providers such as OpenAI, Anthropic, and Cohere offer robust APIs, there are scenarios where self-hosting models is preferable due to cost, data privacy, or customization needs. In such cases, platforms like DeepInfra provide managed hosting for open source models with built-in scalability features. Alternatively, setting up custom hosting solutions using Kubernetes, Vertex AI (GCP) or Amazon Bedrock (AWS) can also be effective, though they require more hands-on management.

Orchestration and workflow management

When number and complexity of AI workflows increase, orchestration tools become essential. Platforms like Prefect, Airflow, or Dagster can help manage complex data pipelines, schedule tasks, and monitor workflow execution.

Airflow

Airflow is a mature battle-tested job control and orchestration platform. Out of the box it can be quite complex to set up and manage. However, it offers extensive capabilities for building and managing sophisticated parallelized workflows, making it suitable for orchestrating large-scale workflows.

In order to speedup setup and simplify management - I have used services such as GCP Composer and Astronomer to host and manage Airflow instances. Using those services new Airflow clusters can be set up in a matter of hours, and workflows can be deployed and running typically within a day.

Other tools and Platforms I Work With

  • HuggingFace
  • Pinecone
  • LangChain
  • CrewAI

Contact Me

If you are ready to start a conversation, feel free to reach out via the contact page or directly via LinkedIn.