1. Convert scanned PDFs to text before upload. If your PDF is a scan (you cannot select text when you open it in a PDF reader), run OCR before uploading. Every platform will attempt OCR automatically, but their results vary. Running OCR yourself with a tool you can verify — and inspecting the output for obvious errors before upload — gives you confidence in the text the model will receive. For Arabic, French, and multilingual documents, choose an OCR tool with explicit support for those languages.
2. Remove or suppress running headers and footers. A 200-page report where "CONFIDENTIAL — Q3 2024 — ACME CORPORATION" appears at the top and bottom of every page adds thousands of tokens of noise to the context. This repetitive content is not harmful in small amounts, but in long documents it consumes context space that could hold meaningful content and fragments the natural flow of sentences across page boundaries. Some PDF editors allow headers and footers to be suppressed; alternatively, a text extraction step that filters repeating lines is effective.
3. Convert to Markdown before upload. Instead of uploading the PDF directly, convert it to Markdown and upload the Markdown text (paste it into the conversation, or save it as a .txt or .md file). Markdown preserves structural signals — headings, bullet points, bold text, tables — that help the model understand document hierarchy. A heading in Markdown tells the model "this is a major topic boundary"; the same text in a raw PDF extraction is indistinguishable from body text. This single step consistently improves summarization, extraction, and question-answering quality on complex documents.
4. Split very long documents by section. For documents beyond the platform's effective context limit, splitting into logical sections and processing each section separately produces better results than uploading the full document and hoping the model handles truncation gracefully. Ask a question about the introduction in one conversation, about the methodology in another. The overhead is modest and the quality improvement is significant for documents where detail in later sections matters.
5. Write a clear system prompt or preamble. Tell the model what the document is, what you want from it, and what format the output should take before it begins processing. "This is a legal contract. Identify all clauses that limit liability, and list them as numbered bullet points with the relevant clause number." is dramatically more useful than "summarize this." The model's ability to extract specific information depends as much on the clarity of the instruction as on the quality of the document.