Package org.apache.poi.hwpf.extractor
Class WordExtractor
java.lang.Object
org.apache.poi.extractor.POITextExtractor
org.apache.poi.extractor.POIOLE2TextExtractor
org.apache.poi.hwpf.extractor.WordExtractor
- All Implemented Interfaces:
Closeable
,AutoCloseable
Class to extract the text from a Word Document.
You should use either getParagraphText() or getText() unless you have a
strong reason otherwise.
- Author:
- Nick Burch
-
Field Summary
Fields inherited from class org.apache.poi.extractor.POIOLE2TextExtractor
document
-
Constructor Summary
ConstructorsConstructorDescriptionCreate a new Word ExtractorCreate a new Word ExtractorCreate a new Word Extractor -
Method Summary
Modifier and TypeMethodDescriptionString[]
String[]
Deprecated.3.8 beta 4String[]
Deprecated.3.8 beta 4String[]
String[]
Get the text from the word file, as an array with one String per paragraphprotected static String[]
getText()
Grab the text, based on the WordToTextConverter.Grab the text out of the text pieces.static void
Command line extractor, so people will stop moaning that they can't just run this.static String
stripFields
(String text) Removes any fields (eg macros, page markers etc) from the string.Methods inherited from class org.apache.poi.extractor.POIOLE2TextExtractor
getDocSummaryInformation, getDocument, getMetadataTextExtractor, getRoot, getSummaryInformation
Methods inherited from class org.apache.poi.extractor.POITextExtractor
close, setFilesystem
-
Constructor Details
-
WordExtractor
Create a new Word Extractor- Parameters:
is
- InputStream containing the word file- Throws:
IOException
-
WordExtractor
Create a new Word Extractor- Parameters:
fs
- POIFSFileSystem containing the word file- Throws:
IOException
-
WordExtractor
- Throws:
IOException
-
WordExtractor
Create a new Word Extractor- Parameters:
doc
- The HWPFDocument to extract from
-
-
Method Details
-
main
Command line extractor, so people will stop moaning that they can't just run this.- Throws:
IOException
-
getParagraphText
Get the text from the word file, as an array with one String per paragraph -
getFootnoteText
-
getMainTextboxText
-
getEndnoteText
-
getCommentsText
-
getParagraphText
-
getHeaderText
Deprecated.3.8 beta 4Grab the text from the headers -
getTextFromPieces
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too. -
getText
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().- Specified by:
getText
in classPOITextExtractor
- Returns:
- All the text from the document
-
stripFields
Removes any fields (eg macros, page markers etc) from the string.
-