- init
- extractText
- writeDebugImages
- clear
- terminate
- exportData
- download
- SortedInputFiles
- importFiles
- recognize
Initialize the program and optionally pre-load resources.
-
paramsObject?params.pdfboolean Load PDF renderer. (optional, defaultfalse)params.ocrboolean Load OCR engine. (optional, defaultfalse)params.fontboolean Load built-in fonts. The PDF renderer and OCR engine are automatically loaded when needed. Therefore, the only reason to setpdforocrtotrueis to pre-load them. (optional, defaultfalse)
Function for extracting text from image and PDF files with a single function call.
By default, existing text content is extracted for text-native PDF files; otherwise text is extracted using OCR.
To control how text from PDF files is handled, set the options in the opt.usePDFText object.
For more control, use init, importFiles, recognize, and exportData separately.
fileslangsArray<string> (optional, default['eng'])outputFormat(optional, default'txt')options(optional, default{})
Clears all document-specific data.
Terminates the program and releases resources.
Export active OCR data to specified format.
format("pdf"|"hocr"|"docx"|"xlsx"|"txt"|"text") (optional, default'txt')minPagenumber First page to export. (optional, default0)maxPagenumber Last page to export (inclusive). -1 exports through the last page. (optional, default-1)
Returns Promise<(string | ArrayBuffer)>
Runs exportData and saves the result as a download (browser) or local file (Node.js).
format("pdf"|"hocr"|"docx"|"xlsx"|"txt"|"text")fileNamestringminPagenumber First page to export. (optional, default0)maxPagenumber Last page to export (inclusive). -1 exports through the last page. (optional, default-1)
An object with this shape can be used to provide input to the importFiles function,
without needing that function to figure out the file types.
This is required when using ArrayBuffer inputs.
Type: Object
pdfFiles(Array<File> | Array<string> | Array<ArrayBuffer>)?imageFiles(Array<File> | Array<string> | Array<ArrayBuffer>)?ocrFiles(Array<File> | Array<string> | Array<ArrayBuffer>)?
Import files for processing.
An object with pdfFiles, imageFiles, and ocrFiles arrays can be provided to import multiple types of files.
Alternatively, for File objects (browser) and file paths (Node.js), a single array can be provided, which is sorted based on extension.
files(Array<File> | FileList | Array<string> | SortedInputFiles)
Recognize all pages in active document.
Files for recognition should already be imported using importFiles before calling this function.
The results of recognition can be exported by calling exportFiles after this function.
-
optionsObject (optional, default{})options.mode("speed"|"quality") Recognition mode. (optional, default'quality')options.langsArray<string> Language(s) in document. (optional, default['eng'])options.modeAdv("lstm"|"legacy"|"combined") Alternative method of setting recognition mode. (optional, default'combined')options.combineMode("conf"|"data"|"none") Method of combining OCR results. Used if OCR data already exists. (optional, default'data')options.vanillaModeboolean Whether to use the vanilla Tesseract.js model. (optional, defaultfalse)options.configObject<string, string> Config params to pass to to Tesseract.js. (optional, default{})