Knowledge Bank https://www.youtube.com/c/BhavamCreations: Regular Expressions 6

Specifying HOW MANY Repetitions

We examine ways of specifying HOW MANY repetitions of a given pattern we want to match.

Other aspects of pattern matching will be kept constant throughout the tutorial, namely:

"WHAT" will be a single character?

"WHERE" will be from the first character in the target onward

"WHICH" occurrence will be the first occurrence?

Quantifiers, Greedy, and Non-greedy Matching

To specify how many repetitions of a given character should be matched, Oracle supports quantifiers like '*'. For example, while the 'b' in the regular expression 'abc' matches one and only one repetition of the letter 'b', an expression like 'ab*c' matches zero more repetitions of 'b', as in the targets 'ac', 'abc', or 'abbc'.
In certain situations, a pattern that uses quantifiers could conceivably match two or more substrings within a target. For example, given a target string like 'abcbcbc' the pattern 'a.*c' could match either 'abc', 'abcbc', or 'abcbcbc'. A match to the largest possible result string, 'abcbcbc', is called a "greedy" match. A match to the smallest possible result string, 'abc', is called a "non-greedy" match. POSIX regular expressions only support greedy matching. Fortunately, Oracle supports some PERL influenced extensions to the POSIX specification that support non-greedy matching. To specify a non-greedy match a '?' character is appended to the POSIX quantifier.
The following table summarizes the available quantifiers and indicates whether they are greedy or not.

Quantifier	Meaning	Greediness
?	zero or one occurrence	greedy
??	zero or one occurrence	non-greedy
*	zero or more occurrences	greedy
*?	zero or more occurrences	non-greedy
+	one or more occurrences	greedy
+?	one or more occurrences	non-greedy
{m}	exactly m occurrences	greedy
{m}?	exactly m occurrences	non-greedy
{m,n}	at least m, but not more than n occurrences	greedy
{m,n}?	at least m, but not more than n occurrences	non-greedy
{m,}	m or more occurrences	greedy
{m,}?	m or more occurrences	non-greedy

"There Can Be Only One"
To match one and only one instance of a character, simply use the character itself in the regular expression pattern.

execute set_pattern( 'b' )

execute set_target( 'aaa' )

execute add_target( 'abc' )

execute add_target( 'abbc' )

execute add_target( 'acc' )

execute add_target( 'bbb' )

execute add_target( 'bbbb' )

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
b	aaa	N		0
b	abbc	Y	b	2
b	abc	Y	b	2
b	acc	N		0
b	bbb	Y	b	1
b	bbbb	Y	b	1

Zero or One
To match zero or one repetitions of a character use the '?' quantifier
For example, to match zero or one instances of the letter 'b' use 'b?'.

execute set_pattern( 'b?' )

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
b?	aaa	Y		1
b?	abbc	Y		1
b?	abc	Y		1
b?	acc	Y		1
b?	bbb	Y	b	1
b?	bbbb	Y	b	1

These results may seem strange at first.
For example, one might expect the MATCHED_VALUE for target 'abc' to be 'b', not null. To understand why this is not the case we first have to understand a few underlying principles.
"zero" instances of a character is equivalent to a single empty string
all target strings are considered to have empty strings within them; e.g. 'abc' is treated like this
(empty string) a (empty string) b (empty string) c (empty string)
Oracle displays empty strings as nulls.
Now let's consider the results. The pattern 'b?' is supposed to match zero or one occurrence of 'b', which is equivalent to saying it matches a single empty string or the character 'b'. When Oracle examines the target 'abc' for a match, it scans the equivalent string
(empty string) a (empty string) b (empty string) c (empty string)
and finds that the first empty string is a match for the pattern. It therefore reports a successful match at location 1 and returns the empty string as the result (displayed as '(null)').
The same thing happens with targets 'aaa', 'abbc', and 'acc'. Oracle never finds the 'b' at position 2 of 'abbc' because it stops looking when it finds the empty string at the beginning of the target.
For the target 'bbb', Oracle could either match the first empty string in the target or the 'b' at position 1. Since 'b?' is a greedy pattern and since 'b' is a larger result than an empty string, Oracle returns the letter 'b' as the match result.
For another perspective on empty strings, consider the following example where all empty strings in the target 'abc' are replaced with an 'X' to disclose their locations.
select 'abc' as target , regexp_replace( 'abc', 'd?', 'X' ) as replaced_target from dual;

TARGET REPLACED_TARGET

---------- -----------------------------------

abc XaXbXcX

In the WHERE to Look tutorial in this section we will see how to use the '^' and '$' characters to match the empty strings at the beginning and end, respectively, of a target.
While trivial patterns such as 'b?' are of little use, non-trivial patterns such as 'a.?c' can be very useful and produce clearer results than 'b?'.

execute set_pattern( 'a.?c' )

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
a.?c	aaa	N		0
a.?c	abbc	N		0
a.?c	abc	Y	abc	1
a.?c	acc	Y	acc	1
a.?c	bbb	N		0
a.?c	bbbb	N		0

Non-greedy matches of zero or more occurrences are specified with the '??' quantifier.

execute set_pattern( 'a.??c' )

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
a.??c	aaa	N		0
a.??c	abbc	N		0
a.??c	abc	Y	abc	1
a.??c	acc	Y	ac	1
a.??c	bbb	N		0
a.??c	bbbb	N		0

Zero or More

A zero or more greedy match is specified with the '*' quantifier, its non-greedy counterpart is '*?'.

execute set_pattern( 'a.*c' ) ;

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
a.*c	Aaa	N		0
a.*c	Abbc	Y	abbc	1
a.*c	Abc	Y	abc	1
a.*c	Acc	Y	acc	1
a.*c	Bbb	N		0
a.*c	Bbbb	N		0

execute set_pattern( 'a.*?c' ) ;

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
a.*?c	aaa	N		0
a.*?c	abbc	Y	abbc	1
a.*?c	abc	Y	abc	1
a.*?c	acc	Y	ac	1
a.*?c	bbb	N		0
a.*?c	bbbb	N		0

One or More
A one or more greedy match is specified with the '+' quantifier, its non-greedy counterpart is '+?'.

execute set_pattern( 'b+' ) ;

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
b+	aaa	N		0
b+	abbc	Y	bb	2
b+	abc	Y	b	2
b+	acc	N		0
b+	bbb	Y	bbb	1
b+	bbbb	Y	bbbb	1

execute set_pattern( 'b+?' ) ;

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
b+?	Aaa	N		0
b+?	Abbc	Y	b	2
b+?	Abc	Y	b	2
b+?	Acc	N		0
b+?	Bbb	Y	b	1
b+?	Bbbb	Y	b	1

Enumerated Occurrences
To specify an exact number or a numeric range of occurrences use the '{}' quantifiers.

execute set_pattern( 'b{2}' )

execute add_pattern( 'b{2}?' )

execute add_pattern( 'b{2,3}' )

execute add_pattern( 'b{2,3}?' )

execute add_pattern( 'b{2,}' )

execute add_pattern( 'b{2,}?' )

set pagesize 39

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
b{2}	aaa	N		0
b{2}	abbc	Y	bb	2
b{2}	abc	N		0
b{2}	acc	N		0
b{2}	bbb	Y	bb	1
b{2}	bbbb	Y	bb	1
b{2}?	aaa	N		0
b{2}?	abbc	Y	bb	2
b{2}?	abc	N		0
b{2}?	acc	N		0
b{2}?	bbb	Y	bb	1
b{2}?	bbbb	Y	bb	1
b{2,3}	aaa	N		0
b{2,3}	abbc	Y	bb	2
b{2,3}	abc	N		0
b{2,3}	acc	N		0
b{2,3}	bbb	Y	bbb	1
b{2,3}	bbbb	Y	bbb	1
b{2,3}?	aaa	N		0
b{2,3}?	abbc	Y	bb	2
b{2,3}?	abc	N		0
b{2,3}?	acc	N		0
b{2,3}?	bbb	Y	bb	1
b{2,3}?	bbbb	Y	bb	1
b{2,}	aaa	N		0
b{2,}	abbc	Y	bb	2
b{2,}	abc	N		0
b{2,}	acc	N		0
b{2,}	bbb	Y	bbb	1
b{2,}	bbbb	Y	bbbb	1
b{2,}?	aaa	N		0
b{2,}?	abbc	Y	bb	2
b{2,}?	abc	N		0
b{2,}?	acc	N		0
b{2,}?	bbb	Y	bb	1
b{2,}?	bbbb	Y	bb	1

Note that the non-greedy patterns 'b{2}?', 'b{2,3}?', and 'b{2,}?' are of little practical use since they all yield the same result set as 'b{2}'.
Repetitions of Strings with Two or More Characters
In the previous examples we saw how to apply quantifiers to a single character. To apply them to a string of characters simply enclose the string with '(' and ')' and apply the quantifier after the ')'.

execute set_pattern( '(ab)*' )

execute add_pattern( '(ab)+' )

execute add_pattern( '(ab){2}' )

execute set_target( 'a' )

execute add_target( 'ab' )

execute add_target( 'abab' )

execute add_target( 'ababab' )

select * from test_results ;

PATTERN	TARGET	MATCH	MATCHED_VALUE	POSITION
(ab)*	a	Y		1
(ab)*	ab	Y	ab	1
(ab)*	abab	Y	abab	1
(ab)*	ababab	Y	ababab	1
(ab)+	a	N		0
(ab)+	ab	Y	ab	1
(ab)+	abab	Y	abab	1
(ab)+	ababab	Y	ababab	1
(ab){2}	a	N		0
(ab){2}	ab	N		0
(ab){2}	abab	Y	abab	1
(ab){2}	ababab	Y	abab	1

Specifying WHERE to Look

In this tutorial we examine ways of specifying WHERE in the source string we should look for a match.

Other aspects of pattern matching will be kept constant throughout the tutorial, namely:

"WHAT" will be a single character

"HOW MANY" will be one repetition of the text

"WHICH" occurrence will be the first occurrence.

Regular Expressions versus LIKE
Before regular expressions, pattern matching was typically done with the LIKE operator. When using this operator a pattern like 'abc' would only match 'abc', not '---abc' or '---abc---'. To match a string that contains 'abc' anywhere inside it, the pattern '%abc%' would be used.

execute set_target( 'abc' )

execute add_target( '---abc' )

execute add_target( '---abc---' )

select 'abc' as pattern, target from targets where target like 'abc';

PATTERN TARGET

--------------- ----------

abc abc

select '%abc%' as pattern, target from targets where target like '%abc%';

PATTERN TARGET

--------------- ----------

%abc% abc

---abc

---abc---

Regular expressions take the opposite approach. The pattern 'abc' matches any target containing 'abc' and a special pattern like '^abc$' must be used to match only the target 'abc'.
select 'abc' as pattern, target from targets where regexp_like( target, 'abc' );

PATTERN TARGET

--------------- ----------

abc abc

---abc

---abc---

select '^abc$' as pattern, target from targets where regexp_like( target, '^abc$' );

PATTERN TARGET

--------------- ----------

^abc$ abc

Matching the Beginning or End of a Line
As we saw in the tutorial HOW MANY, a target string like 'abc' is treated as a sequence of empty strings and characters.
(empty string) a (empty string) b (empty string) c (empty string)
A multi-line string with a Linefeed (LF) in it, like 'abc(LF)def' is treated as follows.
(empty string) a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f (empty string)
This treatment enables us to specify two special locations within the target, the beginning of lines and the end of lines.
By default the character '^' matches the first empty string in a target and the character '$' matches the last empty string in a target. In multi-line mode (initiated using the 'm' parameter in REGEXP conditions and functions) '^' matches the first empty string in the target *and* the first empty string after any Linefeed characters. The character '$' performs similarly for the last empty string in a target or line. The following examples illustrate this. The symbol (^) indicates an empty string matched by '^' and the symbol ($) indicates an empty string matched by '$'.
Default Mode:
(^) a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f ($)
Multi-line Mode:

(^) a (empty string) b (empty string) c ($) (LF)

(^) d (empty string) e (empty string) f ($)

The following examples show how '^' and '$' operate in default mode. In the target data, 'unistr('\000A')' represents a Linefeed character and the target value 'abc(LF)def' appears as 'abc' on one line and 'def' on the next.

execute set_pattern( '^' )

execute add_pattern( '$' )

execute add_pattern( '^.' )

execute add_pattern( '.$' )

execute add_pattern( 'a' )

execute add_pattern( '^a' )

execute add_pattern( '^d' )

execute add_pattern( 'c$' )

execute add_pattern( 'f' )

execute add_pattern( 'f$' )

execute set_target( 'abcdef' )

execute add_target( 'abc' || unistr('\000A') || 'def' )

set pagesize 60

set recsep each

break on pattern

select pattern , target, regexp_substr( target, pattern ) as matched_value ,

regexp_instr( target, pattern ) as position

from patterns , targets

order by sort_key , target desc;

PATTERN	TARGET	MATCHED_VALUE	POSITION
^	abcdef		1
^	abc def		1
$	abcdef		7
$	abc def		8
^.	abcdef	a	1
^.	abc def	a	1
.$	abcdef	f	6
.$	abc def	f	7
a	abcdef	a	1
a	abc def	a	1
^a	abcdef	a	1
^a	abc def	a	1
^d	abcdef		0
^d	abc def		0
c$	abcdef		0
c$	abc def		0
f	abcdef	f	6
f	abc def	f	7
f$	abcdef	f	6
f$	abc def	f	7

Using the same patterns in multi-line mode gives us these results.

Select pattern , target , regexp_substr( target, pattern, 1, 1, 'm' ) as matched_value ,

regexp_instr( target, pattern, 1, 1, 0, 'm' ) as position

from patterns , targets

order by sort_key , target desc;

PATTERN	TARGET	MATCHED_VALUE	POSITION
^	abcdef		1
^	abc def		1
$	abcdef		7
$	abc def		4
^.	abcdef	a	1
^.	abc def	a	1
.$	abcdef	f	6
.$	abc def	c	3
a	abcdef	a	1
a	abc def	a	1
^a	abcdef	a	1
^a	abc def	a	1
^d	abcdef		0
^d	abc def	d	5
c$	abcdef		0
c$	abc def	c	3
f	abcdef	f	6
f	abc def	f	7
f$	abcdef	f	6
f$	abc def	f	7

set recsep wrapped
break on pattern skip 1
For the six character target 'abcdef' note that REGEXP_INSTR returns '1' for both '^' and 'a'. Contrast this with the results for 'f' and '$' which are '6' and '7' respectively. One might expect REGEXP_INSTR to return "0" for '^', to be consistent with the behaviour of "$", but it cannot because '0' is used to indicate that no match was found. Be aware of this difference when using REGEXP_INSTR with '^' and '$'.
Starting the Search From Somewhere Other Than Position One
By default Oracle searches the target string starting from the first character in the target string. To start from a different position, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR accept a "position" parameter to set the starting point.
Starting from position 1 (no position parameter is used):

execute set_target( 'abc1--abc2' )

select target , 'abc.' as pattern , regexp_substr( target, 'abc.' ) as matched_value ,

regexp_instr( target, 'abc.' ) as position

from targets;

TARGET PATTERN MATCHED_VALUE POSITION

---------- --------------- --------------- --------

abc1--abc2 abc. abc1 1

Starting from position 5 (position parameter is set to "5"):

Select target , 'abc.' as pattern , regexp_substr( target, 'abc.', 5 ) as matched_value ,

regexp_instr( target, 'abc.', 5 ) as position

from targets;

TARGET PATTERN MATCHED_VALUE POSITION

---------- --------------- --------------- --------

abc1--abc2 abc. abc2 7

Note that, regardless of which location you start the search at, REGEXP_INSTR always returns a position relative to the start of the entire target.
Unlike the three REGEXP functions, the REGEXP_LIKE condition does not accept a position parameter. The following examples show how to work around this. They all search for targets that have the string 'abc' anywhere at or after the fifth character in a string.

execute set_target( 'abc' )

execute add_target( 'abc---def' )

execute add_target( 'def---abc' )

select target from targets where regexp_like( target, '^.{4}.*abc' ) ;

TARGET

----------

def---abc

select target

from targets

where regexp_like( substr( target, 5 ), 'abc' ) ;

TARGET

----------

def---abc

Specifying WHICH Occurrence

In this tutorial we examine ways of specifying WHICH occurrence of a pattern we wish to match.

Other aspects of pattern matching will be kept constant throughout the tutorial, namely:

"WHAT" will be a single character

"HOW MANY" will be one repetition of the text

"WHERE" will be from the first character in the target onward

REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR

By default a regular expression always matches the first occurrence of a pattern. To match the second, third, etc. occurrence of a pattern the REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR functions accept an "occurrence" parameter.
This example shows the default behavior (matching the first occurrence) when no occurrence parameter is specified.

execute set_pattern( 'ab.' )

execute set_target( 'ab1-ab2' )

execute add_target( 'ab1-ab2-ab3-ab4' )

execute add_target( 'def-def-def' )

execute add_target( 'def-def-def-def' )

column target format a15

select 'ab.' as pattern , target , regexp_substr( target, 'ab.' ) as matched_value ,

regexp_instr( target, 'ab.' ) as position

from targets;

PATTERN TARGET MATCHED_VALUE POSITION

--------------- --------------- --------------- --------

ab. ab1-ab2 ab1 1

ab1-ab2-ab3-ab4 ab1 1

def-def-def (null) 0

def-def-def-def (null) 0

This example shows how to match the third occurrence of a pattern. The same technique can be applied to any number of occurrences.

select 'ab.' as pattern , target , regexp_substr( target, 'ab.', 1, 3 ) as matched_value ,

regexp_instr( target, 'ab.', 1, 3 ) as position

from targets;

PATTERN TARGET MATCHED_VALUE POSITION

--------------- --------------- --------------- --------

ab. ab1-ab2 (null) 0

ab1-ab2-ab3-ab4 ab3 9

def-def-def (null) 0

def-def-def-def (null) 0

REGEXP_LIKE
Unlike the three REGEXP functions we just examined, REGEXP_LIKE does not accept an occurrence parameter. The following examples show ways of working around this.
This example searches for targets with *exactly* three occurrences of 'def'

select target from targets where regexp_like( target, '(def.*){3}' ) and not regexp_like( target, '(def.*){4}') ;

TARGET

---------------

def-def-def

This example searches for targets with *at least* three occurrences of 'def'
select target from targets where regexp_like( target, '(def.*){3}' ) ;

TARGET

---------------

def-def-def

def-def-def-def

This example searches for targets with at least three occurrences of 'ab.'
select target from targets where regexp_like( target, '(ab..*){3}' ) ;

TARGET

---------------

ab1-ab2-ab3-ab4

Of course, some of the requirements above could also be met without using REGEXP_LIKE by simply using a REGEXP function, as in the following examples.

Targets with at least three occurrences of 'def':
select target from targets where regexp_instr( target, 'def', 1, 3 ) != 0 ;

TARGET

---------------

def-def-def

def-def-def-def

Targets with at least three occurrences of 'ab.'

select target from targets where regexp_instr( target, 'ab.', 1, 3 ) != 0 ;

TARGET

---------------

ab1-ab2-ab3-ab4

Cleanup

Run the code on this page to drop the sample tables, procedures, etc. created in earlier parts of this section. To clear session state changes (e.g. those made by SET, COLUMN, and VARIABLE commands) exit your SQL*Plus session after running these cleanup commands.

drop procedure add_target ;

drop procedure set_target ;

drop procedure add_pattern ;

drop procedure set_pattern ;

drop view test_results ;

drop table targets ;

drop table patterns ;

drop table character_class_map ;

exit

Discover Millions of Products your daily Need

Amazon::::Discover Millions of Products your daily Need

Amway Products in Hyderabad. All India Distributors (Certified Dietitian / Nutrition Advisor )

Amway Products in Kakinada. All India Distributors ( Certified Dietitian / Nutrition Advisor)

Thursday, July 23, 2015

Regular Expressions 6

Specifying HOW MANY Repetitions

Specifying WHERE to Look

Specifying WHICH Occurrence

Cleanup

No comments :

Post a Comment