Class PDFMarkedContentExtractor


  • public class PDFMarkedContentExtractor
    extends PDFStreamEngine
    This is an stream engine to extract the marked content of a pdf.
    Version:
    $Revision$
    Author:
    koch
    • Field Detail

      • outputEncoding

        protected java.lang.String outputEncoding
        encoding that text will be written in (or null).
    • Constructor Detail

      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor()
                                  throws java.io.IOException
        Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will not do anything special to convert the text to a more encoding-specific output.
        Throws:
        java.io.IOException - If there is an error loading the properties.
      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor​(java.util.Properties props)
                                  throws java.io.IOException
        Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.
        Parameters:
        props - The properties containing the mapping of operators to PDFOperator classes.
        Throws:
        java.io.IOException - If there is an error reading the properties.
      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor​(java.lang.String encoding)
                                  throws java.io.IOException
        Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will apply encoding-specific conversions to the output text.
        Parameters:
        encoding - The encoding that the output will be written in.
        Throws:
        java.io.IOException - If there is an error reading the properties.
    • Method Detail

      • beginMarkedContentSequence

        public void beginMarkedContentSequence​(COSName tag,
                                               COSDictionary properties)
      • endMarkedContentSequence

        public void endMarkedContentSequence()
      • xobject

        public void xobject​(PDXObject xobject)
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Overrides:
        processTextPosition in class PDFStreamEngine
        Parameters:
        text - The text to process.
      • getMarkedContents

        public java.util.List<PDMarkedContent> getMarkedContents()