一个人免费观看www高清视频,人妻有码中文字幕

搜索

APP

起點(diǎn)課堂會(huì)員權(quán)益

職業(yè)體系課特權(quán)

線下行業(yè)大會(huì)特權(quán)

個(gè)人IP打造特權(quán)

30+門專項(xiàng)技能課

1300+專題課程

12場職場軟技能直播

12場求職輔導(dǎo)直播

12場專業(yè)技能直播

會(huì)員專屬社群

榮耀標(biāo)識(shí)

發(fā)布

注冊 | 登錄

使用 Gemini 2.0 Flash 對數(shù)百萬個(gè)文檔進(jìn)行攝取和 RAG

來學(xué)習(xí)一下

2025-06-21

0 評(píng)論 2320 瀏覽 3 收藏

12 分鐘

在大模型盛行的時(shí)代，能高效處理海量文檔的 RAG（檢索增強(qiáng)生成）方案正成為企業(yè)的剛需。但現(xiàn)實(shí)是：高延遲、高成本、低吞吐，攔住了大多數(shù)人。而 Google 最新發(fā)布的 Gemini 2.0 Flash，用一次性能的“降維打擊”，讓 PDF 轉(zhuǎn)文本、并行攝取、快速問答不再遙不可及。

首先將每個(gè) PDF 頁面轉(zhuǎn)換為圖像，然后將它們發(fā)送以進(jìn)行 OCR，只是為將原始文本轉(zhuǎn)換為可用的 HTML 或 Markdown。接下來，您仔細(xì)檢測并重新構(gòu)建每個(gè)表，將內(nèi)容切成塊以進(jìn)行語義檢索，最后將它們?nèi)坎迦氲绞噶繑?shù)據(jù)庫中，整個(gè)成本是非常高。

Google 的 Gemini 2.0 Flash 就可以簡化整個(gè)過程。

在一個(gè)步驟中捆綁 OCR 和分塊，而成本只是其中的一小部分。這篇文章恰恰探討這種可能性。我將展示 Gemini 2.0 Flash 如何一次性將 PDF 轉(zhuǎn)換為分塊的、可用于 Markdown 的文本，讓您擺脫冗余的多步驟。然后，我們將這些數(shù)據(jù)存儲(chǔ)在可擴(kuò)展矢量數(shù)據(jù)庫，用于快速矢量搜索。

本指南介紹如何：

使用?Gemini 2.0 Flash 將 PDF 頁面直接轉(zhuǎn)換為分塊文本；
將塊存儲(chǔ)在矢量數(shù)據(jù)庫，用于快速搜索；
在 RAG 工作流程中將它們?nèi)柯?lián)系在一起；

這是目前的模型價(jià)格

如果您不需要原始 PDF 中的邊界框，這種方法比舊的 OCR 管道簡單得多，成本也低得多。

傳統(tǒng)的 PDF 攝取問題

為什么 PDF 攝取如此困難？

復(fù)雜布局?：多列文本、腳注、側(cè)邊欄、圖像或掃描的表單。
表格提取?：傳統(tǒng)的 OCR 工具通常會(huì)將表格展平為雜亂的文本。
高成本?：使用 GPT-4o 或其他大型 LLM 會(huì)很快變得昂貴，尤其是在您處理數(shù)百萬個(gè)頁面時(shí)。
多種工具?：您可以運(yùn)行 Tesseract for OCR、用于表檢測的布局模型、用于 RAG 的單獨(dú)分塊策略等。

許多團(tuán)隊(duì)最終會(huì)得到一個(gè)脆弱且昂貴的巨大管道。新方法是：“只需將 PDF 頁面作為圖像顯示給多模態(tài) LLM，提示它分塊，然后看著奇跡發(fā)生?！?/p>

這就是?Gemini 2.0 Flash?的用武之地。

為什么選擇 Gemini 2.0 Flash？

成本?：~6,000 頁/美元（使用批量調(diào)用和最少的輸出令牌）。這很容易比許多其他解決方案（GPT-4、專門的 OCR 供應(yīng)商等）便宜 5-30 倍。

準(zhǔn)確性?：標(biāo)準(zhǔn)文本的保真度令人驚訝。大多數(shù)錯(cuò)誤是微小的結(jié)構(gòu)差異，尤其是對于表格。

最大的缺失部分是邊界框數(shù)據(jù)。如果您需要將像素完美的疊加層重新覆蓋到 PDF 上，Gemini 的邊界框生成仍然遠(yuǎn)非準(zhǔn)確。但是，如果您主要關(guān)心是基于文本的檢索或摘要，那么它更便宜、更快、更容易。

端到端架構(gòu)

分步代碼

1）安裝依賴并創(chuàng)建基本表

!apt-get update

!apt-get install -y poppler-utils

!pip install -q google-generativeai kdbai-client sentence-transformers pdf2image

import os

import kdbai_client as kdbai

from sentence_transformers import SentenceTransformer

# start session with KDB.AI Server

session = kdbai.Session(endpoint=”http://localhost:8082″)

db = session.database(‘default’)

print(“Connected to KDB.AI:”, db)

您可以注冊矢量數(shù)據(jù)庫。免費(fèi) AI 服務(wù)器在這里：?https://trykdb.kx.com/kdbai/signup/

2）創(chuàng)建 Vector Table

# Define KDB.AI table schema

VECTOR_DIM = 384 # we’ll use all-MiniLM-L6-v2 for embeddings

schema = [

{“name”: “id”, “type”: “str”},

{“name”: “text”, “type”: “str”},

{“name”: “vectors”, “type”: “float32s”}

]
# Build a simple L2 distance index

index = [

{
“name”: “flat_index”,

“type”: “flat”,

“column”: “vectors”,

“params”: {“dims”: VECTOR_DIM, “metric”: “L2”}

}
]

table_name = “pdf_chunks”

try:

db.table(table_name).drop()

except kdbai.KDBAIException:

pass

table = db.create_table(table_name, schema=schema, indexes=index)

print(f”Table ‘{table_name}’ created.”)

3）將 PDF 頁面轉(zhuǎn)換為圖像

# Convert PDF to images

import requests

from pdf2image import convert_from_bytes

import base64

import io

pdf_url = “https://arxiv.org/pdf/2404.08865″ # example PDF

resp = requests.get(pdf_url)

pdf_data = resp.content

pages = convert_from_bytes(pdf_data)

print(f”Converted {len(pages)} PDF pages to images.”)

# We’ll encode the images as base64 for easy sending to Gemini

images_b64 = {}

for i, page in enumerate(pages, start=1):

buffer = io.BytesIO()

page.save(buffer, format=”PNG”)

image_data = buffer.getvalue()

b64_str = base64.b64encode(image_data).decode(“utf-8”)

images_b64[i] = b64_str

4）調(diào)用 Gemini 2.0 Flash 進(jìn)行 OCR + 分塊

# Configure Gemini & define chunking prompt

import google.generativeai as genai

GOOGLE_API_KEY = “YOUR_GOOGLE_API_KEY”

genai.configure(api_key=GOOGLE_API_KEY)

model = genai.GenerativeModel(model_name=”gemini-2.0-flash”)

print(“Gemini model loaded:”, model)

CHUNKING_PROMPT = “””\

OCR the following page into Markdown. Tables should be formatted as HTML.

Do not surround your output with triple backticks.

Chunk the document into sections of roughly 250 – 1000 words.

Surround each chunk with <chunk> and </chunk> tags.

Preserve as much content as possible, including headings, tables, etc.

5）使用一個(gè) prompt 處理每個(gè)頁面

# OCR + chunking function

import re

def process_page(page_num, image_b64):

# We’ll create the message payload:

payload = [

{
“inline_data”: {“data”: image_b64, “mime_type”: “image/png”}

},

{
“text”: CHUNKING_PROMPT

}
]

try:

resp = model.generate_content(payload)

text_out = resp.text

except Exception as e:

print(f”Error processing page {page_num}: {e}”)

return []

# parse <chunk> blocks

chunks = re.findall(r”<chunk>(.*?)</chunk>”, text_out, re.DOTALL)

if not chunks:

# fallback if model doesn’t produce chunk tags

chunks = text_out.split(“\n\n”)

results = []

for idx, chunk_txt in enumerate(chunks):

# store ID, chunk text

results.append({

“id”: f”page_{page_num}_chunk_{idx}”,

“text”: chunk_txt.strip()

})

return results

all_chunks = []

for i, b64_str in images_b64.items():

page_chunks = process_page(i, b64_str)

all_chunks.extend(page_chunks)

print(f”Total extracted chunks: {len(all_chunks)}”)

6）在矢量數(shù)據(jù)庫中嵌入塊和存儲(chǔ)

# Embedding & Insertion

embed_model = SentenceTransformer(“all-MiniLM-L6-v2”)

chunk_texts = [ch[“text”] for ch in all_chunks]

embeddings = embed_model.encode(chunk_texts)

embeddings = embeddings.astype(“float32”)

import pandas as pd

row_list = []

for idx, ch_data in enumerate(all_chunks):

row_list.append({

“id”: ch_data[“id”],

“text”: ch_data[“text”],

“vectors”: embeddings[idx].tolist()

})

df = pd.DataFrame(row_list)

table.insert(df)

print(f”Inserted {len(df)} chunks into ‘{table_name}’.”)

7）查詢和構(gòu)建 RAG 流程

相似度搜索

# Vector query for RAG

user_query = “How does this paper handle multi-column text?”

qvec = embed_model.encode(user_query).astype(“float32”)

search_results = table.search(vectors={“flat_index”: [qvec]}, n=3)

retrieved_chunks = search_results[0][“text”].tolist()

context_for_llm = “\n\n”.join(retrieved_chunks)

print(“Retrieved chunks:\n”, context_for_llm)

8）最終生成

# SNIPPET 8: RAG generation

final_prompt = f”””Use the following context to answer the question:

Context:

{context_for_llm}

Question: {user_query}

Answer:

“””

resp = model.generate_content(final_prompt)

print(“\n=== Gemini’s final answer ===”)

print(resp.text)

最后的思考

用戶反饋?：真實(shí)用戶已經(jīng)用 Gemini 取代了專門的 OCR 供應(yīng)商進(jìn)行 PDF 攝取，從而節(jié)省了時(shí)間和成本?。
當(dāng)邊界框很重要時(shí)?：如果您必須精確跟蹤 PDF 上每個(gè)塊的位置，您將需要一種混合方法。
可擴(kuò)展性?：制作數(shù)百萬個(gè)頁面？確保批量調(diào)用和限制令牌。這就是您達(dá)到 ~6,000 頁/美元的最佳位置的方式。單頁調(diào)用或大型輸出的成本更高。
簡單性：您可以跳過六個(gè)微服務(wù)或 GPU 管道。對許多人來說，僅此一項(xiàng)就是一種巨大的解脫。

本文由 @來學(xué)習(xí)一下原創(chuàng)發(fā)布于人人都是產(chǎn)品經(jīng)理。未經(jīng)作者許可，禁止轉(zhuǎn)載

題圖來自Unsplash，基于CC0協(xié)議

該文觀點(diǎn)僅代表作者本人，人人都是產(chǎn)品經(jīng)理平臺(tái)僅提供信息存儲(chǔ)空間服務(wù)

更多精彩內(nèi)容，請關(guān)注人人都是產(chǎn)品經(jīng)理微信公眾號(hào)或下載App