{"id":2406789,"date":"2023-12-01T19:58:00","date_gmt":"2023-12-02T00:58:00","guid":{"rendered":"https:\/\/platoaistream.net\/plato-data\/building-a-rag-pipeline-for-semi-structured-data-with-langchain\/"},"modified":"2023-12-01T19:58:00","modified_gmt":"2023-12-02T00:58:00","slug":"building-a-rag-pipeline-for-semi-structured-data-with-langchain","status":"publish","type":"station","link":"https:\/\/platoaistream.net\/plato-data\/building-a-rag-pipeline-for-semi-structured-data-with-langchain\/","title":{"rendered":"Building A RAG Pipeline for Semi-structured Data with Langchain"},"content":{"rendered":"

Introduction<\/h2>\n

Retrieval Augmented Generation has been here for a while. Many tools and applications are being built around this concept, like vector stores, retrieval frameworks, and LLMs, making it convenient to work with custom documents, especially Semi-structured Data with Langchain. Working with long, dense texts has never been so easy and fun. The conventional RAG<\/a> works well with unstructured text-heavy files like DOC, PDFs, etc. However, this approach does not sit well with semi-structured data, such as embedded tables in PDFs.<\/p>\n

While working with semi-structured data, there are usually two concerns.<\/p>\n