Stanford NLP Tokenization Maven Example

By Dhiraj 10 July, 2017

This tutorial is about setting up standford NLP in eclipse IDE with maven. Here we will be creating an example to tokenize any raw text. We wil be using maven to build our project and define different dependencies related to Standford NLP. Apart from setting up the standford NLP in eclipse, we will also take a look into how DocumentPreprocessor and PTBTokenizer can be used to tokenize any raw text.

What is Stanford Tokenizer

Stanford Tokenizer divides text into a sequence of tokens, which roughly correspond to "words". Stanford also provides PTBTokenizer to tokenize formal english.

We will be creating an example using both the tokenizer to tokenize raw text.

 Other NLP Articles
Standford NLP Named Entity Recognition
Apache OpenNLP Named Entity Recognition Example
Apache OpenNLP Maven Eclipse Example
Standford NLP POS Tagger Example
OpenNLP POS Tagger Example

Project Structure

Maven Dependencies

pom.xml

<dependencies>
		<dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
			<version>3.5.0</version>
        </dependency>
		
		<dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
			<version>3.5.0</version>
			<classifier>models</classifier>
        </dependency>
		<dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
			<version>4.12</version>
			<scope>test</scope>
        </dependency>
</dependencies>

Implementing StandfordTokenizer Using DocumentPreprocessor

StandfordTokenizer.java

package com.devglan;

import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.process.DocumentPreprocessor;

import java.util.List;

public class StandfordTokenizer {

    public DocumentPreprocessor tokenize(String fileName){
        DocumentPreprocessor dp = new DocumentPreprocessor(fileName);
        for (List sentence : dp) {
            System.out.println(sentence);
        }
        return dp;
    }

}

Implementing StandfordTokenizer Using Standford PTBTokenizer

PTBTokenizerExample.java

package com.devglan;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.PTBTokenizer;

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.HashSet;
import java.util.Set;

public class PTBTokenizerExample {

    public Set tokenize(String fileName) throws FileNotFoundException {
        Set labels = new HashSet<>();
        PTBTokenizer ptbt = new PTBTokenizer<>(new FileReader(fileName),
                new CoreLabelTokenFactory(), "");
        while (ptbt.hasNext()) {
            CoreLabel label = ptbt.next();
            System.out.println(label);
            labels.add(label);
        }
        return labels;
    }

}

Testing the Application

Following are some test cases to test Standford tokenizer.

TokenizerTest.java

package com.devglan;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.process.DocumentPreprocessor;
import org.junit.Assert;
import org.junit.Test;

import java.io.IOException;
import java.util.Set;

public class TokenizerTest {

    @Test
    public void SentenceDetectorTest() throws IOException {
        StandfordTokenizer tokenizer = new StandfordTokenizer();
        DocumentPreprocessor dp = tokenizer.tokenize("standford.txt");
        Assert.assertTrue(dp != null);
    }

    @Test
    public void SentencePosDetectorTest() throws IOException {
        PTBTokenizerExample tokenizer = new PTBTokenizerExample();
        Set labels  = tokenizer.tokenize("C:/D/workspaces/standfordsetupdemo/src/main/resources/standford.txt");
        Assert.assertTrue(labels != null && labels.size() > 0);
    }

}

Output

Conclusion

I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.

Download source

If You Appreciate This, You Can Consider:

We are thankful for your never ending support.

A technology savvy professional with an exceptional capacity to analyze, solve problems and multi-task. Technical expertise in highly scalable distributed systems, self-healing systems, and service-oriented architecture. Technical Skills: Java/J2EE, Spring, Hibernate, Reactive Programming, Microservices, Hystrix, Rest APIs, Java 8, Kafka, Kibana, Elasticsearch, etc.

What is Stanford Tokenizer

Project Structure

Maven Dependencies

Implementing StandfordTokenizer Using DocumentPreprocessor

Implementing StandfordTokenizer Using Standford PTBTokenizer

Testing the Application

Output

Conclusion

Download source

Google Artificial Intelligence And Seo

Opennlp Named Entity Recognition Example

If You Appreciate This, You Can Consider:

About The Author

Further Reading on Artificial Intelligence

Recommended

Category

Apache Kafka

Spring Cloud

Angular JS

Spring Boot

Spring Security

React JS

Python

Artificial Intelligence

Java 8

Hibernate

Spring MVC

Core Java

Spring Jdbc

Node JS

Android

Data Structure

Core Java Programs

Contact Us

Quick Links

Quick Links

Newsletter

Stanford NLP Tokenization Maven Eclipse Example

What is Stanford Tokenizer

Project Structure

Maven Dependencies

Implementing StandfordTokenizer Using DocumentPreprocessor

Implementing StandfordTokenizer Using Standford PTBTokenizer

Testing the Application

Output

Conclusion

Download source

If You Appreciate This, You Can Consider:

About The Author

Further Reading on Artificial Intelligence

Recommended

Category