Version: 0.94

operators.pdf.parser2.PDFToRichDocParser2

class operators.pdf.parser2.PDFToRichDocParser2(field, parser_version=0)

Operator that parses a PDF into Snorkel’s RichDoc representation.

This operator parses PDF to create a Snorkel-Flow-native representation of PDF documents, including a richer text representation, spatial information, etc. with the original formatting preserved. RichDoc representation empowers in-depth tools with PDF-formatted data.

The output includes: a stripped raw text representation of the rich doc (rich_doc_text), a serialized RichDoc that corresponds to rich_doc_text (rich_doc_pkl), a serialized list of RichDoc objects, one per page (page_docs), and character offsets of text starting on each page (page_char_starts).

This parser will ignore parsing errors by default. The documents with errors will be skipped. PDFs with parsing errors are logged and errors are raised to the user.

Parameters Parameters

Name	Type	Default	Info
field	`str`		The name of the column in the dataframe contains PDF urls.
parser_version	`Optional[int]`	`0`	The version of the parser used to parse the PDF. If not provided, the latest parser version is used.

Returns Returns

rich_doc_text – A stripped raw text representation of the rich doc
rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text
page_docs – A serialized list of RichDoc objects, one per page
page_char_starts – A character offsets of text starting on each page

Parameters

Parameters​

Returns

Returns​

Parameters

Returns