
Extracting text from unstructured documents is a classic developer headache. For decades, traditional Optical Character Recognition (OCR) systems have struggled with complex layouts, often turning multi-column PDFs, embedded images, and nested tables into an unreadable mess of plain text.
Today, the multimodal capabilities of large language models (LLMs) finally make reliable document understanding possible.
LlamaParse bridges the gap between traditional OCR and vision-language agentic parsing. It delivers state-of-the-art text extraction across PDFs, presentations, and images.
In this post, you will learn how to use Gemini to power LlamaParse, extract high-quality text and tables from unstructured documents, and build an intelligent personal finance assistant. As a reminder, Gemini models may make mistakes and should not be relied upon for professional advice.
Why LlamaParse?
In many cases, LLMs can already perform this task effectively, however, when working with large document collections or highly variable formats, consistency and reliability can become more challenging.
Dedicated tools like LlamaParse complement LLM capabilities by introducing preprocessing steps and customizable parsing instructions, which help structure complex elements such as large tables or dense text. In general parsing benchmarks, this approach has shown around a 13–15% improvement compared to processing raw documents directly.
The use case: parsing brokerage statements
Brokerage statements represent the ultimate document parsing challenge. They contain dense financial jargon, complex nested tables, and dynamic layouts.
To help users understand their financial situation, you need a workflow that not only parses the file, but explicitly extracts the tables and explains the data through an LLM.
Because of these advanced reasoning and multimodal requirements, Gemini 3.1 Pro is the perfect fit as the underlying model. It balances a massive context window with native spatial layout comprehension.
The workflow operates in four stages:
- Ingest: You submit a PDF to the LlamaParse engine.
- Route: The engine parses the document and emits a
ParsingDoneEvent. - Extract: This event triggers two parallel tasks — text extraction and table extraction — that run concurrently to minimize latency.
- Synthesize: Once both extractions complete, Gemini generates a human-readable summary.
This two-model architecture is a deliberate design choice: Gemini 3.1 Pro handles the hard layout-comprehension during parsing, while Gemini 3 Flash handles the final summarization — optimizing for both accuracy and cost.
You can find the complete code for this tutorial in the LlamaParse x Gemini demo GitHub repository.
Setting up the environment
First, install the necessary Python packages for LlamaCloud, LlamaIndex workflows, and the Google GenAI SDK.
# with pip
pip install llama-cloud-services llama-index-workflows pandas google-genai
# with uv
uv add llama-cloud-services llama-index-workflows pandas google-genai
Shell
Next, export your API keys as environment variables. Get a Gemini API key from AI Studio, and a LlamaCloud API key from the console. Security Note: Never hardcode your API keys in your application source code.
export LLAMA_CLOUD_API_KEY="your_llama_cloud_key"
export GEMINI_API_KEY="your_google_api_key"
Shell
Step 1: Create and use the parser
The first step in your workflow is parsing. You create a LlamaParse client backed by Gemini 3.1 Pro and define it in resources.py so you can inject it into your workflow as a resource:
def get_llama_parse() -> LlamaParse:
return LlamaParse(
api_key=os.getenv("LLAMA_CLOUD_API_KEY"),
parse_mode="parse_page_with_agent",
model="gemini-3.1-pro",
result_type=ResultType.MD,
)
Python
The parse_page_with_agent mode applies a layer of agentic iteration guided by Gemini to correct and format OCR results based on visual context.
In workflow.py, define the events, state, and the parsing step:
class BrokerageStatementWorkflow(Workflow):
@step
async def parse_file(
self,
ev: FileEvent,
ctx: Context[WorkflowState],
parser: Annotated[LlamaParse, Resource(get_llama_parse)]
) -> ParsingDoneEvent | OutputEvent:
result = cast(ParsingJobResult, (await parser.aparse(file_path=ev.input_file)))
async with ctx.store.edit_state() as state:
state.parsing_job_result = result
return ParsingDoneEvent()
Python
Notice that you do not process parsing results immediately. Instead, you store them in the global WorkflowState so they are available for the extraction steps that follow.
Step 2: Extract the text and tables
To provide the LLM with the context required to explain the financial statement, you need to extract the full markdown text and the tabular data. Add the extraction steps to your BrokerageStatementWorkflow class (see the full implementation in workflow.py):
@step
async def extract_text(self, ev: ParsingDoneEvent, ctx: Context[WorkflowState]) -> TextExtractionDoneEvent:
# Extraction logic omitted for brevity. See repo.
@step
async def extract_tables(self, ev: ParsingDoneEvent, ctx: Context[WorkflowState], ...) -> TablesExtractionDoneEvent:
# Extraction logic omitted for brevity. See repo.
Python
Because both steps listen for the same ParsingDoneEvent, LlamaIndex Workflows automatically executes them in parallel. This means your text and table extractions run concurrently — cutting overall pipeline latency and making the architecture naturally scalable as you add more extraction tasks.
Step 3: Generate the summary
With the data extracted, you can prompt Gemini 3.1 Pro to generate a summary in accessible, non-technical language.
Configure the LLM client and prompt template in resources.py. Here, you use Gemini 3 Flash for the final summarization, as it offers low latency and cost efficiency for text aggregation tasks.
The final synthesis step uses ctx.collect_events to wait for both extractions to complete before calling the Gemini API.
@step
async def ask_llm(
self,
ev: TablesExtractionDoneEvent | TextExtractionDoneEvent,
ctx: Context[WorkflowState],
llm: Annotated[GenAIClient, Resource(get_llm)],
template: Annotated[Template, Resource(get_prompt_template)]
) -> OutputEvent:
if ctx.collect_events(ev, [TablesExtractionDoneEvent, TextExtractionDoneEvent]) is None:
return None
# Full prompt and LLM call available in repo.
Python
Running the workflow
To tie it all together, the main.py entry point creates and runs the workflow:
wf = BrokerageStatementWorkflow(timeout=600)
result = await wf.run(start_event=FileEvent(input_file=input_file))
Python
To test the workflow, download a sample statement from the LlamaIndex datasets:
curl -L > brokerage_statement.pdf
Shell
# Using pip
python3 main.py brokerage_statement.pdf
# Using uv
uv run run-workflow brokerage_statement.pdf
Shell
You now have a fully functional personal finance assistant running in your terminal, capable of analyzing complex financial PDFs.
Next steps
AI pipelines are only as good as the data you feed them. By combining Gemini 3.1 Pro’s multimodal reasoning with LlamaParse’s agentic ingestion, you ensure your applications have the full, structured context they need — not just flattened text.
When you base your architecture on event-driven statefulness, like the parallel extractions demonstrated here, you build systems that are fast, scalable, and resilient. Double-check outputs before relying on them.
Ready to implement this in production? Explore LlamaParse and the Gemini API documentation to experiment with multimodal generation, and dive into the full code in the GitHub repository.
