The missing piece in the AI data stack
In the past year, it has become clear that there are pervasive gaps in enabling enterprises to ingest and transform data, across a variety of formats, in a manner which is best suited for operation with/by LLMs. This gap is predicated on differentiation of these data pipelines compared to those within analytics workflows. This presents an opportunity for early stage startups in building a critical part of the AI data stack.
AI data pipelines look very different, in terms of input/output stream, from previous ones which transformed data from one structured format to another. There are opportunities in defining best practices for unstructured inputs/vector outputs in lineage, drift, versioning, etc.
The multimodal capabilities of Generative AI models have strong ramifications in extending what modalities can be included in an individual pipeline and indexed together in a data store. We made an investment earlier this year in Objective following similar reasoning that LLMs enable newfound capabilities in search indexing across modalities (image, text, video, audio, etc).
The use of LLMs both in the indexing and generation within data stores leads to a number of challenges and opportunities. A range of composable transforms based on varying prompts, chain of thought reasoning, and across a range of models provides new capabilities but also requires strong, extensible tooling in order to produce/evaluate the best outputs in an ETL pipeline.
We believe that new offerings will emerge, both from cloud service providers and early-stage startups, to bridge this gap and enable any data to be operable/transformable for use with LLMs.