In this article we will be discussing about ways and techniques to read word documents in Java using Apache POI library. The word document may contain images, tables or plain text. Apart from this a standard word file has header and footers too. Here in the following examples we will be parsing a word document by reading its different paragraph, runs, images, tables along with headers and footers. We will also take a look into identifying different styles associated with the paragraphs such as font-size, font-family, font-color etc.
Maven Dependencies
Following is the poi maven depedency required to read word documents. For latest artifacts visit here
pom.xml<dependencies> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.16</version> </dependency> </dependencies>
Reading Complete Text from Word Document
The class XWPFDocument
has many methods defined to read and extract .docx
file contents. getText()
can be used to read all the texts in a .docx word document. Following is an example.
public class TextReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc); System.out.println(extractor.getText()); } catch(Exception ex) { ex.printStackTrace(); } } }
Reading Headers and Foooters of Word Document
Apache POI provides inbuilt methods to read headers and footers of a word document. Following is an example that reads and prints header and footer of a word document. The example .docx file is available in the source which can be downloaded at the end of thos article.
HeaderFooter.javapublic class HeaderFooterReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc); XWPFHeader header = policy.getDefaultHeader(); if (header != null) { System.out.println(header.getText()); } XWPFFooter footer = policy.getDefaultFooter(); if (footer != null) { System.out.println(footer.getText()); } } catch (Exception ex) { ex.printStackTrace(); } } }Output
Other Interesting Posts Java 8 Lambda Expression Java 8 Stream Operations Java 8 Datetime Conversions Random Password Generator in Java
Read Each Paragraph of a Word Document
Among the many methods defined in XWPFDocument
class, we can use getParagraphs()
to read a .docx word document paragraph wise.This method returns a list of all the paragraphs(XWPFParagraph) of a word document. Again the XWPFParagraph has many utils method defined to extract information related to any paragraph such as text alignment, style associated with the paragrpahs.
To have more control over the text reading of a word document,each paragraph is again divided into multiple runs. Run defines a region of text with a common set of properties.Following is an example to read paragraphs from a .docx word document.
ParagraphReader.javapublic class ParagraphReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); ListparagraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph : paragraphList) { System.out.println(paragraph.getText()); System.out.println(paragraph.getAlignment()); System.out.print(paragraph.getRuns().size()); System.out.println(paragraph.getStyle()); // Returns numbering format for this paragraph, eg bullet or lowerLetter. System.out.println(paragraph.getNumFmt()); System.out.println(paragraph.getAlignment()); System.out.println(paragraph.isWordWrapped()); System.out.println("********************************************************************"); } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Tables from Word Document
Following is an example to read tables present in a word document. It will print all the text rows wise.
TableReader.javapublic class TableReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); IteratorbodyElementIterator = xdoc.getBodyElementsIterator(); while (bodyElementIterator.hasNext()) { IBodyElement element = bodyElementIterator.next(); if ("TABLE".equalsIgnoreCase(element.getElementType().name())) { List tableList = element.getBody().getTables(); for (XWPFTable table : tableList) { System.out.println("Total Number of Rows of Table:" + table.getNumberOfRows()); for (int i = 0; i < table.getRows().size(); i++) { for (int j = 0; j < table.getRow(i).getTableCells().size(); j++) { System.out.println(table.getRow(i).getCell(j).getText()); } } } } } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Styles from Word Document
Styles are associated with runs of a paragraph. There are many methods available in the XWPFRun
class to identify the styles associated with the text.There are methods to identify boldness, highlighted words, capitalized words etc.
public class StyleReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); ListparagraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph : paragraphList) { for (XWPFRun rn : paragraph.getRuns()) { System.out.println(rn.isBold()); System.out.println(rn.isHighlighted()); System.out.println(rn.isCapitalized()); System.out.println(rn.getFontSize()); } System.out.println("********************************************************************"); } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Image from Word Document
Following is an example to read image files from a word document.
public class ImageReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); Listpic = xdoc.getAllPictures(); if (!pic.isEmpty()) { System.out.print(pic.get(0).getPictureType()); System.out.print(pic.get(0).getData()); } } catch (Exception ex) { ex.printStackTrace(); } } }
Conclusion
I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.