The Document to Text step lets you extract the text content from a document file. We use deterministic AI models to parse content from a variety of file types, including PDFs, Word documents, and more.

Options

NameTypeDescription
FileDocument FileThe document you want to extract text from.

Outputs

NameTypeDescription
File ContentsPlain TextThe text extracted from the document.

Tips

  • If the document has structured elements, such as tables or multiple side-by-side elements, the extracted text may be parsed incorrectly. We try our best to remediate this issues, but we recommend cleaning the output up with a Generate Text step if this happens consistently.

Support File Types

The only mimetypes supported for OCR are:

  • application/pdf
  • image/gif
  • image/tiff
  • image/jpeg
  • image/png
  • image/bmp
  • image/webp

Excel spreadsheets and Word documents are parsed using third party libraries. Anything else that isn’t in the above supported mimetype list is converted into plaintext.