DeepSeek OCR: Guide
DeepSeek-OCR offers a novel approach to processing long texts. Instead of direct text recognition, the system compresses visual information from documents to make them more efficient for downstream Large Language Models (LLMs). This article explores the functionality, installation, and practical implications of this model.
Introduction DeepSeek-OCR
DeepSeek-OCR visually compresses text content. Document pages are understood as images, condensed into a few vision tokens, and then reconstructed into text or Markdown. The team reports a seven to twenty-fold reduction in tokens and up to about 97 percent precision with moderate compression, depending on the compression level. Official code, scripts, and a vLLM connection are available.
DeepSeek-OCR is not a classic Tesseract replacement. It is a vision-language system consisting of two parts: An encoder (DeepEncoder) generates compact vision tokens; an approximately 3-billion-parameter MoE decoder reconstructs text or Markdown from them. The goal is less pure character recognition than context compression for downstream LLM workflows. The
Model Card
describes validated environments (Python 3.12.9, CUDA 11.8, Torch 2.6.0, Flash-Attention 2.7.3) and shows prompts like “
Installation and Usage
Using DeepSeek-OCR requires specific prerequisites and precise installation.
Clarify prerequisites
An NVIDIA GPU with the latest driver, CUDA 11.8, and Python 3.12.9 are required. The tested package statuses include, among others, torch==2.6.0, transformers==4.46.3, tokenizers==0.20.3, flash-attn==2.7.3 . The GitHub README notes the same stack; vLLM support is official.
Load source code
The source code is loaded using git clone https://github.com/deepseek-ai/DeepSeek-OCR.git . Then change to the created folder.
Create environment
A Conda environment is created and activated with conda create -n deepseek-ocr python=3.12.9 -y; conda activate deepseek-ocr .
Install packages (Transformers path)
The necessary packages are installed using the following commands:
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.46.3 tokenizers==0.20.3 einops addict easydict
pip install flash-attn==2.7.3 --no-build-isolation
Details and tested combinations can be found in the Model Card .
Infer first image (Transformers)
To infer an image using the Transformers library, proceed as follows in Python:
from transformers import AutoModel, AutoTokenizer
# ...
model = AutoModel.from_pretrained('deepseek-ai/DeepSeek-OCR', _attn_implementation='flash_attention_2', trust_remote_code=True).eval().cuda().to(torch.bfloat16)
An example prompt is "<image>\n<|grounding|>Convert the document to markdown.". After setting model.infer(...) called. The complete snippet is available in the
Model Card available.
vLLM Serving for Throughput (optional, officially supported)
vLLM can be used for higher throughput:
uv venv; source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Then a LLM(model="deepseek-ai/DeepSeek-OCR") is created in Python with vLLM, images are passed as PIL-Images, and generated with SamplingParams . Code examples can be found in the
README
and the
Model Card.
. The repository contains scripts like
README as a guideline "~2500 Tokens/s" on an A100-40G.
Select Prompts and Modes
The prompt "<image>\n<|grounding|>Convert the document to markdown." is used for documents. For pure OCR without layout, use "<image>\nFree OCR.".
Supported image sizes include “Tiny/Small/Base/Large” as well as a dynamic “Gundam” mode. Details can be found in the
README
and the
Model Card.
Process PDFs
PDFs can be processed with Repo show input and output paths.
Check Result
The output is in Markdown or text. Tables and figures can be reproduced as structured text. Quality and speed depend on compression level, resolution, and GPU.
Troubleshooting
When building flash-attn, the option Discussions.
Chronology and Status
The initial release took place on 20.10.2025 in the Repo; ; vLLM support has also been integrated "upstream" into vLLM since 23.10.2025. The Paper was submitted to arXiv on 21.10.2025. Media classify this as "Vision-Text Compression".
Quelle: YouTube
Analysis and Evaluation
DeepSeek-OCR aims to reduce the cost and latency in LLM workflows by visually compressing long contexts.
Motives, Context, Interests
The approach is motivated by the high cost of long contexts. Compressing pages as images into a few vision tokens significantly reduces the token budget for downstream models. Official integration of vLLM aims for high throughput in production pipelines. Tech media emphasize the potential cost and latency gains, but warn of hardware and data-dependent variance.

Quelle: pxz.ai
DeepSeek OCR uses context compression to significantly increase efficiency compared to traditional vision LLMs and reduce token costs.
Fact Check: Evidence vs. Claims
Substantiated
The architecture (DeepEncoder + 3B-MoE-Decoder), the reported precision values for <10x and 20x compression, and the objective of “Context Compression” are confirmed in the Paper . Installation steps, scripts and example prompts can be found in the README and in the Model Card; ; vLLM support is documented there.
Unclear
Generic “X times faster” statements without identical hardware or data context are not transferable. Real throughput depends heavily on GPU, resolution, prompt, and batch size.
False/Misleading
DeepSeek-OCR is not “just a faster OCR”. The core purpose is visual compression for LLM workflows. For pure, simple text recognition, classic OCR (e.g., Tesseract) ) may still be useful.

Quelle: freedeepseekocr.com
The DeepSeek-OCR demo interface allows easy uploading of documents and selecting different model sizes for processing.
Reactions & Counterpositions
Tech reports highlight the 7–20x token saving. Skeptical voices ask about robustness across layouts and languages, as well as quality loss with strong compression. Developers document setups and hurdles on specific hardware. Community posts report very fast PDF-to-Markdown processing under vLLM, but these are anecdotal. Practical benefit: Anyone bringing long PDFs, tables, forms, or reports into LLM pipelines can use DeepSeek-OCR to reduce costs and latency, provided the reconstruction remains precise enough. For fast serving, the vLLM path is worthwhile; for minimal setups, Transformers-Inference is sufficient. For simple, “clean” scans without layout demands, Tesseract may be more efficient.
Impacts & What it means for you
Tips for classification: Primary sources first (Paper, README, Model Card), then your own measurements on the hardware; compare variants of prompt, resolution, and compression level.
How stable are the trade-offs across languages, handwriting, scans, and fine table structures? Independent benchmarks and replication studies are still pending. How is official CPU/MPS support developing beyond community workarounds? Discussions exist, but without hard guarantees. How robust is PDF throughput under real production loads and away from A100 hardware? The
Quelle: YouTube
Open Questions
README README mentions examples, but no generally valid SLA values.

Quelle: chattools.cn
Detailed diagrams illustrate the impressive compression and performance metrics of DeepSeek OCR, underlining its efficiency.
Summary and Recommendations
To use DeepSeek-OCR effectively, the environment should be set up exactly as described in the Model Card or in the README . Start with the Transformers example and switch to vLLM for higher throughput. Adjust prompts and modes to the respective documents and weigh the quality against the compression level. For pure, simple OCR cases, classic OCR remains a lean option; for long, complex documents, visual context compression plays to its strengths.