operators.pdf.parser2.PDFToRichDocParser2
- class operators.pdf.parser2.PDFToRichDocParser2(field, parser_version=0)
Operator that parses a PDF into Snorkel’s RichDoc representation.
This operator parses PDF to create a Snorkel-Flow-native representation of PDF documents, including a richer text representation, spatial information, etc. with the original formatting preserved. RichDoc representation empowers in-depth tools with PDF-formatted data.
The output includes: a stripped raw text representation of the rich doc (rich_doc_text), a serialized RichDoc that corresponds to rich_doc_text (rich_doc_pkl), a serialized list of RichDoc objects, one per page (page_docs), and character offsets of text starting on each page (page_char_starts).
This parser will ignore parsing errors by default. The documents with errors will be skipped. PDFs with parsing errors are logged and errors are raised to the user.
Parameters
Parameters
Returns
Returns
rich_doc_text – A stripped raw text representation of the rich doc
rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text
page_docs – A serialized list of RichDoc objects, one per page
page_char_starts – A character offsets of text starting on each page
Name Type Default Info field str
The name of the column in the dataframe contains PDF urls. parser_version Optional[int]
0
The version of the parser used to parse the PDF. If not provided, the latest parser version is used.