DeepSeek OCR: Guide

Lisa Ernst · 23.10.2025 · Technology · 6 min

DeepSeek-OCR offers a novel approach to processing long texts. Instead of direct text recognition, the system compresses visual information from documents to make them more efficient for downstream Large Language Models (LLMs). This article explores the functionality, installation, and practical implications of this model.

Introduction DeepSeek-OCR

DeepSeek-OCR visually compresses text content. Document pages are understood as images, condensed into a few vision tokens, and then reconstructed into text or Markdown. The team reports a seven to twenty-fold reduction in tokens and up to about 97 percent precision with moderate compression, depending on the compression level. Official code, scripts, and a vLLM connection are available.

DeepSeek-OCR is not a classic Tesseract replacement. It is a vision-language system consisting of two parts: An encoder (DeepEncoder) generates compact vision tokens; an approximately 3-billion-parameter MoE decoder reconstructs text or Markdown from them. The goal is less pure character recognition than context compression for downstream LLM workflows. The Model Card describes validated environments (Python 3.12.9, CUDA 11.8, Torch 2.6.0, Flash-Attention 2.7.3) and shows prompts like “\n<|grounding|>Convert the document to markdown.”. The source code contains ready-made scripts for images, PDFs, and benchmark runs.

Installation and Usage

Using DeepSeek-OCR requires specific prerequisites and precise installation.

Clarify prerequisites

An NVIDIA GPU with the latest driver, CUDA 11.8, and Python 3.12.9 are required. The tested package statuses include, among others, torch==2.6.0, transformers==4.46.3, tokenizers==0.20.3, flash-attn==2.7.3 . The GitHub README notes the same stack; vLLM support is official.

Load source code

The source code is loaded using git clone https://github.com/deepseek-ai/DeepSeek-OCR.git . Then change to the created folder.

Create environment

A Conda environment is created and activated with conda create -n deepseek-ocr python=3.12.9 -y; conda activate deepseek-ocr .

Install packages (Transformers path)

The necessary packages are installed using the following commands:

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.46.3 tokenizers==0.20.3 einops addict easydict
pip install flash-attn==2.7.3 --no-build-isolation

Details and tested combinations can be found in the Model Card .

Infer first image (Transformers)

To infer an image using the Transformers library, proceed as follows in Python:

from transformers import AutoModel, AutoTokenizer
# ...
model = AutoModel.from_pretrained('deepseek-ai/DeepSeek-OCR', _attn_implementation='flash_attention_2', trust_remote_code=True).eval().cuda().to(torch.bfloat16)

An example prompt is "<image>\n<|grounding|>Convert the document to markdown.". After setting model.infer(...) called. The complete snippet is available in the Model Card available.

vLLM Serving for Throughput (optional, officially supported)

vLLM can be used for higher throughput:

uv venv; source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Then a LLM(model="deepseek-ai/DeepSeek-OCR") is created in Python with vLLM, images are passed as PIL-Images, and generated with SamplingParams . Code examples can be found in the README and the Model Card. . The repository contains scripts like README as a guideline "~2500 Tokens/s" on an A100-40G.

Select Prompts and Modes

The prompt "<image>\n<|grounding|>Convert the document to markdown." is used for documents. For pure OCR without layout, use "<image>\nFree OCR.". Supported image sizes include “Tiny/Small/Base/Large” as well as a dynamic “Gundam” mode. Details can be found in the README and the Model Card.

Process PDFs

PDFs can be processed with Repo show input and output paths.

Check Result

The output is in Markdown or text. Tables and figures can be reproduced as structured text. Quality and speed depend on compression level, resolution, and GPU.

Troubleshooting

When building flash-attn, the option Discussions.

Chronology and Status

The initial release took place on 20.10.2025 in the Repo; ; vLLM support has also been integrated "upstream" into vLLM since 23.10.2025. The Paper was submitted to arXiv on 21.10.2025. Media classify this as "Vision-Text Compression".

Quelle: YouTube

Analysis and Evaluation

DeepSeek-OCR aims to reduce the cost and latency in LLM workflows by visually compressing long contexts.

Motives, Context, Interests

The approach is motivated by the high cost of long contexts. Compressing pages as images into a few vision tokens significantly reduces the token budget for downstream models. Official integration of vLLM aims for high throughput in production pipelines. Tech media emphasize the potential cost and latency gains, but warn of hardware and data-dependent variance.

Quelle: pxz.ai

DeepSeek OCR uses context compression to significantly increase efficiency compared to traditional vision LLMs and reduce token costs.

Fact Check: Evidence vs. Claims

Substantiated

The architecture (DeepEncoder + 3B-MoE-Decoder), the reported precision values for <10x and 20x compression, and the objective of “Context Compression” are confirmed in the Paper . Installation steps, scripts and example prompts can be found in the README and in the Model Card; ; vLLM support is documented there.

Unclear

Generic “X times faster” statements without identical hardware or data context are not transferable. Real throughput depends heavily on GPU, resolution, prompt, and batch size.

False/Misleading

DeepSeek-OCR is not “just a faster OCR”. The core purpose is visual compression for LLM workflows. For pure, simple text recognition, classic OCR (e.g., Tesseract) ) may still be useful.

Quelle: freedeepseekocr.com

The DeepSeek-OCR demo interface allows easy uploading of documents and selecting different model sizes for processing.

Reactions & Counterpositions

Tech reports highlight the 7–20x token saving. Skeptical voices ask about robustness across layouts and languages, as well as quality loss with strong compression. Developers document setups and hurdles on specific hardware. Community posts report very fast PDF-to-Markdown processing under vLLM, but these are anecdotal. Practical benefit: Anyone bringing long PDFs, tables, forms, or reports into LLM pipelines can use DeepSeek-OCR to reduce costs and latency, provided the reconstruction remains precise enough. For fast serving, the vLLM path is worthwhile; for minimal setups, Transformers-Inference is sufficient. For simple, “clean” scans without layout demands, Tesseract may be more efficient.

Impacts & What it means for you

Tips for classification: Primary sources first (Paper, README, Model Card), then your own measurements on the hardware; compare variants of prompt, resolution, and compression level.

How stable are the trade-offs across languages, handwriting, scans, and fine table structures? Independent benchmarks and replication studies are still pending. How is official CPU/MPS support developing beyond community workarounds? Discussions exist, but without hard guarantees. How robust is PDF throughput under real production loads and away from A100 hardware? The

Quelle: YouTube

Open Questions

README README mentions examples, but no generally valid SLA values.

Quelle: chattools.cn

Detailed diagrams illustrate the impressive compression and performance metrics of DeepSeek OCR, underlining its efficiency.

Summary and Recommendations

To use DeepSeek-OCR effectively, the environment should be set up exactly as described in the Model Card or in the README . Start with the Transformers example and switch to vLLM for higher throughput. Adjust prompts and modes to the respective documents and weigh the quality against the compression level. For pure, simple OCR cases, classic OCR remains a lean option; for long, complex documents, visual context compression plays to its strengths.

DeepSeek OCR: Guide

Introduction DeepSeek-OCR

Installation and Usage

Clarify prerequisites

Load source code

Create environment

Install packages (Transformers path)

Infer first image (Transformers)

vLLM Serving for Throughput (optional, officially supported)

Select Prompts and Modes

Process PDFs

Check Result

Troubleshooting

Chronology and Status

Analysis and Evaluation

Motives, Context, Interests

Fact Check: Evidence vs. Claims

Substantiated

Unclear

False/Misleading

Reactions & Counterpositions

Impacts & What it means for you

Open Questions

Summary and Recommendations

Über Zerlo

Links

Social Media