Skip to main content
Version: 0.94

operators.pdf.parser2.PDFToRichDocParser2

class operators.pdf.parser2.PDFToRichDocParser2(field, parser_version=0)

Operator that parses a PDF into Snorkel’s RichDoc representation.

This operator parses PDF to create a Snorkel-Flow-native representation of PDF documents, including a richer text representation, spatial information, etc. with the original formatting preserved. RichDoc representation empowers in-depth tools with PDF-formatted data.

The output includes: a stripped raw text representation of the rich doc (rich_doc_text), a serialized RichDoc that corresponds to rich_doc_text (rich_doc_pkl), a serialized list of RichDoc objects, one per page (page_docs), and character offsets of text starting on each page (page_char_starts).

This parser will ignore parsing errors by default. The documents with errors will be skipped. PDFs with parsing errors are logged and errors are raised to the user.

Parameters

NameTypeDefaultInfo
fieldstrThe name of the column in the dataframe contains PDF urls.
parser_versionOptional[int]0The version of the parser used to parse the PDF. If not provided, the latest parser version is used.

Returns

  • rich_doc_text – A stripped raw text representation of the rich doc

  • rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text

  • page_docs – A serialized list of RichDoc objects, one per page

  • page_char_starts – A character offsets of text starting on each page