Pages

Thursday, September 13, 2018

Parsing natural language using Regular Expression patterns and extractors

Bot Libre 7 adds support for using Regular Expressions in patterns, templates, and scripts.

Regular Expressions, or Regex defines a pattern syntax for parsing text. Unlike AIML and Bot Libre patterns Regex patterns are character based, not word base, so can match specific types of words and word sequences such as numbers, dates, times, currency, and others.

For example, the following regex matches a number,

/\d+

and this regex would match a date,

/^(19|20)\d\d[-/.](0[1-9]|1[012])[-/.](0[1-9]|[12][0-9]|3[01])$

Bot Libre allows regex expressions to be used in AIML patterns, and in Bot Libre response patterns. Bot Libre's scripting language Self also allows regex in patterns and provides extractor functions that allow regex to be used to extract data from a user's input.

To define a regex pattern in an AIML or Bot Libre pattern just start the regex with the "/" character.

AIML defines pattern wildcards such as * and ^ which can match multiple words in a phrase, but they will match any word, and are not restricted to specific types of words. Bot Libre lets you include regex inside AIML patterns to match specific types of words.  Just like the * wildcard the word that was matched by the regex can be accessed in the template using the <star/> tag.

<category>
<pattern>my email is /.+\@.+\..+</pattern>
<template>Okay, I will email you at <star/></template>
</category>

Normally regex is used to match a specific word, but you can also use regex to match and entire phrase if it defines the entire pattern.

For this to work the entire pattern must be the regex, and the pattern can have no other words. The "()" characters in regex define a group which becomes the star variable(s).

<category>
<pattern>/(?i)what\sis\s(.*)</pattern>
<template>I have no idea what <star/> is.</template>
</category>

Patterns and regex can also be used in Bot Libre response lists similar to AIML.

Pattern("my email is /.+\@.+\..+")
Template("Okay, I will email you at {star}")

Pattern("/(?i)what\sis\s(.*)")
Template("I have no idea what {star} is.")

In a response list template you can also use Self extractor functions.

I am 22 years old
Template("I will remember that you are { var age = sentence.exec("\d+"); speaker.age = age; age } years old.")

Regex can also be used in Self patterns and functions.

state Math {
    pattern "^ /\d+ \* /\d+ ^" template "{star[1].toNumber() * star[2].toNumber()}";
    pattern "^ /\d+ / /\d+ ^" template "{star[1].toNumber() / star[2].toNumber()}";
    pattern "^ /\d+ \+ /\d+ ^" template "{star[1].toNumber() + star[2].toNumber()}";
    pattern "^ /\d+ \- /\d+ ^" template "{star[1].toNumber() - star[2].toNumber()}";
}

The following are regex functions in Self:

  • Utils.matches(text, regex) - return if the regex matches the text
    Utils.matches("12345", "\d+") == true
  • String.test(text) - return if the regex string it matches the text
    "\d".test("hello 123") == true
  • String.exec(text) - extract the subtext matching the regex string from the text
    "\d+".exec("hello 123") != "123"
  • String.match(text) - returns an array of all values matching the regex string extracted from the text
    var values = "hello 123 456".match("\d+"); values[1] == "456"

Bot Libre also defines several symbols for common regex patterns.
These include:

  • #number
  • #date
  • #email
  • #url

These symbols can be used in place of regex patterns in Self and patterns.

<category>
<pattern>my email is #email</pattern>
<template>Okay, I will email you at <star/></template>
</category>
state Email {
    pattern "^ #email ^" topic "email" template "Thank you, I will remember your email. { think { speaker.email = Utils.extract(sentence, #email); conversation.topic = null; } }";
    pattern "*" topic "email" template "Please enter a valid email.";
}

For more information on the regex syntax and example patterns see:

https://www.w3schools.com/jsref/jsref_obj_regexp.asp

https://regexr.com/

No comments:

Post a Comment