Regex pattern to match sentence

A sentence is constructed by the following rules.

1.Starts with a capital letter with a space preceding the end of the previous sentence. The space is not mandatory at the start of the first sentence.

 -The above implies a word boundary (\w)

2.A sentence can be made up of one or more words. Within the words,there can be non-word(\W) characters e.g @,^ e.g He bought two eggs @USD 2.

-Some of the  non word characters are mainly used as punctuation marks (\p{Punct}).They include  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

3.The word/non-word characters are space (\p{Space}) separated e.g at home, a, this building.

4.TIP.When processing sentences that have words enclosed by characters like “” ‘’,Java IDEs like Netbeans  will try to format the “” and ‘’ to special looking quotation marks since String variables are initialized by enclosing between “”.This formatting helps differentiate between the outer enclosing “” and in sentence “”.

e.g Copy paste the String below to a String variable declared in Netbeans IDE.

Mary’s mother said to her, “get in the house before it starts raining”.

See below image for the applied formatting.

You will realize the quotation marks get formatted to accommodate them within the string. This formatted form of the quotation marks needs to be taken care of in our regex pattern. This is particularly helpful when copy pasting text from another website ,a word document, notepad document for processing. Alternatively ,you will need to escape the “” with a single backslash i.e \” (e.g Mary’s mother said to her, \“get in the house before it starts raining\”.)

The \” needs to be included in our regex pattern as well .For single quotes(‘’) e.g. Mary’s in our sample string, no need for escaping since the string variable is enclosed within double quotes “”.i.e if the String is enclosed within “”,then the string can contain ‘’(single quotes).However, on copy pasting a string with ‘’ or ’ ,it will still be specially formatted and the special formatted character needs to be taken care of in our regex.

We now have 2 flavors of regex for as show below

Regex 1.

Pattern pattern = Pattern.compile("\\b[\\[\\]\\w, '\":;$^@#%(){}“”’-]+[.?!]");

This regex states that we can have one or more words that are space separated. Our words can also contain special symbols like @ ,#,% etc. Sentences end with ?,! or .

 

Regex 2

Pattern pattern = Pattern.compile("\\b[\\w\\p{Space}“”’\\p{Punct}&&[^.?!]]+[.?!]");

 

Regex 2 is an enhancement of Regex 1,we use POSIX character class \p{Space}

To indicate space and \p{Punct} to indicate punctuation .Punctuation is one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

However we negate ! ,? and . since we want them to only appear at the end of our sentences implying our sentences can be statements ,questions or exclamations.

5.With the above rules and our regex patterns ,we can now  match a sentence or sentences in a paragraph.A simple use case of a program that computes number of sentences in a paragraph would be an essay writing competition which given a topic,the contestants are supposed to write an essay with no more than x,with x being say 20 sentences.

The best essay with <=20 sentences wins the competition.

 

Below is a sample java  program  that accomplishes this task.

 

package devsought;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

 

public class RegexToMatchSentence {

 

    public static void main(String... args) {

 

        //you can uncomment any of the below str variable to test each of the scenarios discussed. 

        String str = "Lorem Ipsum is simply dummy text of the (printing) and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.";

        // String str = "In academic writing, readers expect each paragraph to have a sentence or two that captures its main point. They’re often called “topic sentences,” though many writing instructors prefer to call them “key sentences”. There are at least two downsides of the phrase “topic sentence”. First, it makes it seem like the paramount job of that sentence is simply to announce the topic of the paragraph. Second, it makes it seem like the topic sentence must always be a single grammatical sentence. Calling it a “key sentence” reminds us that it expresses the central idea of the paragraph. And sometimes a question or a two-sentence construction functions as the key.";

        // String str = "Key sentences in academic writing do two things. First, they establish the main point that the rest of the paragraph supports. Second, they situate each paragraph within the sequence of the argument, a task that requires transitioning from the prior paragraph. Consider these two examples:[2].";          

        // String str = "In academic writing, readers expect each paragraph to have a sentence or two that captures its main point. They’re often called \"topic sentences,\" though many writing instructors prefer to call them “key sentences”. There are at least two downsides of the phrase “topic sentence”. First, it makes it seem like the paramount job of that sentence is simply to announce the topic of the paragraph. Second, it makes it seem like the topic sentence must always be a single grammatical sentence. Calling it a “key sentence” reminds us that it expresses the central idea of the paragraph. And sometimes a question or a two-sentence construction functions as the key.";

        // String str = "Mary’s mother said to her, “get in the house before it starts raining”.";

        // String str="Mary's mother said to her, \"get in the house before it starts raining\".";

        // String str="It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).";

        //Pattern pattern = Pattern.compile("\\b[\\[\\]\\w, '\":;$^@#%(){}“”’-]+[.?!]");

        //or use pattern below

        Pattern pattern = Pattern.compile("\\b[\\w\\p{Space}“”’\\p{Punct}&&[^.?!]]+[.?!]");

        Matcher matcher = pattern.matcher(str);

 

        int count = 0;

        while (matcher.find()) {

            System.out.println(matcher.group());

            count++;

        }

 

        System.out.println("Paragraph has " + count + " sentences.");

 

    }

 

}

The above program outputs the following.

Lorem Ipsum is simply dummy text of the (printing) and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Paragraph has 4 sentences.

Sample paragraphs to test this program are obtained from below sources.

  1. https://courses.lumenlearning.com/waymaker-level3-english/chapter/text-the-perfect-paragraph/
  2. https://www.lipsum.com/

Feel free to use above code as is and or make adjustments where necessary.

The above code is also availabe on Github.

 

About the Author - John Kyalo Mbindyo(Bsc Computer Science) is a Senior Application Developer currently working at NCBA Bank Group,Nairobi- Kenya.He is passionate about making programming tutorials and sharing his knowledge with other software engineers across the globe. You can learn more about him and follow him on  Github.