Friday, November 2, 2012

Regular Expressions 1

Oracle introduced regular expression functions in 10g version.
REGEXP_INSTR, REGEXP_REPLACE and REGEXP_SUBSTR. These functions are similar in functionality that of standard string functions INSTR, REPLACE and SUBSTR respectively but with more powerful capabilities of pattern matching. Using regular expression functions, we can search, validate, remove unwanted text and/or replace characters from the string. It can be used for any character data including CLOB and NCLOB as well.
Introduced in Oracle 11g the REGEXP_COUNT function greatly simplifies counting the number of times a pattern appears inside a string. The following examples demonstrate how to use REGEXP_COUNT with some simple patterns. For versions prior to 11g alternative logic that provides similar functionality is also presented.
REGEXP_COUNT functions returns the number of occurrences of the pattern in the source string. It takes four arguments.
         Source string: It is the source string from which we are going to find the occurrence of the pattern. This can be CLOB or NCLOB characters as well.
         Pattern: Is any string, pattern or text which will be found in the source string. Data type of pattern will be converted to data type of source string, if it is different than the source string.
         Position: It is a non-zero positive value from which search should begin.
         Matching parameter: which controls the behavior of the search? It takes one or more value from.
o   I (ignore case)
o   c – Case Sensitive
o   m – Allows source string as multiple lines
o   n – Allows period.
o   x – Ignore white space characters
Now let us start with the some simple example. In following example, we are counting occurrence of ‘i’ in the source string. We are ignoring case in matching parameter.

Setup Regular Expression Data

Before proceeding however, the following code will set up some test data for us (see Setup if you want to create the SET_TARGET and ADD_TARGET procedures on your own test system).
Run the code on this page in SQL*Plus to create the sample tables, data, etc. used by the examples in this section.
Be sure to read Using SQL Snippets ™ before executing any of these setup steps.
=====================================================================
Setup for pattern matching examples
=====================================================================
create table targets ( target varchar2(15) );
create procedure set_target( p_target varchar2 ) as
begin
  delete from targets ;
  insert into targets values ( p_target ) ;
end;
/

create procedure add_target( p_target varchar2 ) as
begin
  insert into targets values ( p_target );
end;
/
create table patterns ( pattern varchar2(25), sort_key number );
create procedure set_pattern( p_pattern varchar2 ) as
begin
  delete from patterns;
  insert into patterns values ( p_pattern, 1 );
end;
/

create procedure add_pattern( p_pattern varchar2 ) as
begin
  insert into patterns
  values
  ( p_pattern ,
    ( select count(*) + 1 from patterns )
  );
end;
/

create view test_results as
select 
pattern , target,  case  when regexp_like( target, pattern ) then 'Y'  else 'N' end    as match,
  regexp_substr( target, pattern ) as matched_value ,
  regexp_instr( target, pattern ) as position
from   patterns,  targets
order by   patterns.sort_key,  target;

column pattern         format a15
column target          format a10
column match           format a5
column matched_value   format a15
column modifier        format a8
column replaced_target format a15
column position        format 999
column substring       format a15
break on pattern skip 1

set null     '(null)'
set linesize 100
=====================================================================
             Setup for the section on POSIX character classes            
=====================================================================
create table character_class_map
( data    varchar2(2)  ,
  symbol  varchar2(3)  ,
  dec     number       ,
  hex     varchar2(4)  ,
  upper   varchar2(1)  ,
  lower   varchar2(1)  ,
  alpha   varchar2(1)  ,
  digit   varchar2(1)  ,
  xdigit  varchar2(1)  ,
  alnum   varchar2(1)  ,
  blank   varchar2(1)  ,
  space   varchar2(1)  ,
  cntrl   varchar2(1)  ,
  punct   varchar2(1)  ,
  graph   varchar2(1)  ,
  print   varchar2(1)
);

declare
  v_data      varchar2(2) ;
  v_hex_value varchar2(4) ;
  v_symbol    varchar2(3) ;

begin

  for  i  in  0  ..  255
  loop
      v_hex_value := to_char( i, 'FM000X' ) ;
      v_data      := unistr( '\' || v_hex_value  ) ;
      v_symbol :=
        case i
        when 0   then 'NUL'  when 1   then 'SOH'  when 2   then 'STX'
        when 3   then 'ETX'  when 4   then 'EOT'  when 5   then 'ENQ'
        when 6   then 'ACK'  when 7   then 'BEL'  when 8   then 'BS'
        when 9   then 'TAB'  when 10  then 'LF'   when 11  then 'VT'
        when 12  then 'FF'   when 13  then 'CR'   when 14  then 'SO'
        when 15  then 'SI'   when 16  then 'DLE'  when 17  then 'DC1'
        when 18  then 'DC2'  when 19  then 'DC3'  when 20  then 'DC4'
        when 21  then 'NAK'  when 22  then 'SYN'  when 23  then 'ETB'
        when 24  then 'CAN'  when 25  then 'EM'   when 26  then 'SUB'
        when 27  then 'ESC'  when 28  then 'FS'   when 29  then 'GS'
        when 30  then 'RS'   when 31  then 'US'   when 127 then 'DEL'
        when 128 then 'XXX'  when 129 then 'XXX'  when 130 then 'BPH'
        when 131 then 'NBH'  when 132 then 'IND'  when 133 then 'NEL'
        when 134 then 'SSA'  when 135 then 'ESA'  when 136 then 'HTS'
        when 137 then 'HTJ'  when 138 then 'VTS'  when 139 then 'PLD'
        when 140 then 'PLU'  when 141 then 'RI'   when 142 then 'SS2'
        when 143 then 'SS3'  when 144 then 'DCS'  when 145 then 'PU1'
        when 146 then 'PU2'  when 147 then 'STS'  when 148 then 'CCH'
        when 149 then 'MW'   when 150 then 'SPA'  when 151 then 'EPA'
        when 152 then 'SOS'  when 153 then 'XXX'  when 154 then 'SCI'
        when 155 then 'CSI'  when 156 then 'ST'   when 157 then 'OSC'
        when 158 then 'PM'   when 159 then 'APC'  when 173 then 'SHY'
        else v_data
        end
      ;
insert into character_class_map (data , symbol , dec , hex )
values  ( v_data , v_symbol , i , v_hex_value );

  end loop;
end;
/

update character_class_map set upper = 'Y' where regexp_like( data, '[[:upper:]]' );
update character_class_map set lower = 'Y' where regexp_like( data, '[[:lower:]]' );
update character_class_map set alpha = 'Y' where regexp_like( data, '[[:alpha:]]' );
update character_class_map set digit = 'Y' where regexp_like( data, '[[:digit:]]' );
update character_class_map set xdigit= 'Y' where regexp_like( data, '[[:xdigit:]]');
update character_class_map set alnum = 'Y' where regexp_like( data, '[[:alnum:]]' );
update character_class_map set blank = 'Y' where regexp_like( data, '[[:blank:]]' );
update character_class_map set space = 'Y' where regexp_like( data, '[[:space:]]' );
update character_class_map set cntrl = 'Y' where regexp_like( data, '[[:cntrl:]]' );
update character_class_map set punct = 'Y' where regexp_like( data, '[[:punct:]]' );
update character_class_map set graph = 'Y' where regexp_like( data, '[[:graph:]]' );
update character_class_map set print = 'Y' where regexp_like( data, '[[:print:]]' );
commit;

column symbol format  a6
column dec    format  999
column hex    format  a4
column upper  format  a5
column lower  format  a5
column alpha  format  a5
column digit  format  a5
column xdigit format  a6
column alnum  format  a5
column blank  format  a5
column space  format  a5
column cntrl  format  a5
column punct  format  a5
column graph  format  a5
column print  format  a5

execute set_target( 'aaa'        )
execute add_target( 'a b c'      )
execute add_target( 'b  b  b  b' )
execute add_target( '  bbb xx  ' )
execute add_target( 'wdef def w' )
execute add_target( 'defxdefdef' )
execute add_target( null         )
set null "(null)"
column target        heading "TARGET|STRING"
column pattern_count heading "PATTERN|COUNT"

REGEXP_COUNT

SQL> SELECT REGEXP_COUNT(‘Decipher Information System’,’i’,1,’i’) FROM DUAL;
REGEXP_COUNT(‘DECIPHERINFORMATIONSYSTEM’,’I’,1,’I’)
—————————————————
3

There are 3 incidents of the ‘i’ in the string. If we change the match parameter to ‘c’, occurrence will be only 2.
SQL> SELECT REGEXP_COUNT('Decipher Information System','i',1,'c') FROM DUAL;
REGEXP_COUNT(‘DECIPHERINFORMATIONSYSTEM’,’I’,1,’C’)
—————————————————
2

Following is the example, where we are searching for pattern starting from the position 3.
SQL> SELECT REGEXP_COUNT('Cat in the hat is on the bat.','at',4) FROM DUAL;
REGEXP_COUNT(‘CATINTHEHATISONTHEBAT.’,’AT’,4)
———————————————
2

This function really helps to identify number of occurrences of the specific pattern in the source string. It reduces the long PL/SQL code to identify number of occurrences into one single function call.
We can start counting from a specified position as shown under Select REGEXP_COUNT ('We are now learning the REGEXP_COUNT function of Oracle 11g.more at www.beyondrelational.com', 'r',10) As Result From Dual;
We have started counting from position 10 till the end of the string. That means the operation will happen inside 'learning the REGEXP_COUNT function of Oracle 11g' test and hence 3 matches found. So the count is 3.
We can even make it case sensitive or insensitive by specifying the Match Option as under
Select  REGEXP_COUNT ('We are now learning the regexp_count function of Oracle 11g.more at www.beyondrelational.com', 'R',10) As CaseSensitiveResult
,REGEXP_COUNT ('We are now learning the regexp_count function of Oracle 11g.more at www.beyondrelational.com', 'R',10,'i') As CaseInSensitiveResult
From Dual;
We can even find occurrences of multiple characters. We can identify digits using "\d" or "[0-9]" and the "{4}" operator signifies 4 of them, so using "\d{4}" or "[0-9]{4}" with the REGEXP_COUNT function seems to be a valid option.
WITH T1 AS
(SELECT '1234' DATA FROM DUAL
UNION ALL
SELECT '1234 1234' FROM DUAL
UNION ALL
SELECT '1234 1234 1234' FROM DUAL
)SELECT REGEXP_COUNT(data, '[0-9]{4}') AS pattern_count_1,
REGEXP_COUNT(data, '\d{4}') AS pattern_count_2 FROM T1
Count String Pattern Occurrences (Oracle 11g)
To count the occurrences of the letter "b" in the target strings we can use a query like this.
Select   target ,   REGEXP_COUNT( TARGET, 'b' ) as pattern_count  from   targets ;
TARGET
PATTERN_COUNT
  bbb xx 
3
a b c
1
aaa
0
b  b  b  b
4
defxdefdef
0
wdef def w
0
 (null)    
 (null)    

This query finds the total occurrences of the letters "a", "c", or "f" in the target strings.
Select   target ,   REGEXP_COUNT( TARGET, '[acf]' ) as pattern_count  from  targets ;
TARGET
PATTERN_COUNT
aaa
3
a b c
2
b  b  b  b
0
  bbb xx 
0
wdef def w
2
defxdefdef
3
 (null)    
 (null)    

To calculate the number of times a string like "def" occurs we use a query like this.
Select   target ,   REGEXP_COUNT( TARGET, 'def' ) as pattern_count from  targets; 
TARGET
PATTERN_COUNT
aaa
0
a b c
0
b  b  b  b
0
  bbb xx 
0
wdef def w
2
defxdefdef
3
 (null)    
 (null)    
To find the number of words in a string we can use a query like this (for our purposes we define a word as any series of characters that are not a space).

Select   target ,  REGEXP_COUNT( TARGET, '[^ ]+' ) as pattern_count from   targets; 
TARGET
PATTERN_COUNT
aaa
1
a b c
3
b  b  b  b
4
  bbb xx 
2
wdef def w
3
defxdefdef
1
 (null)    
 (null)    
To ensure PATTERN_COUNT always returns a non-null value use NVL.
Select   target ,  NVL( regexp_count( target, '[^ ]+' ), 0 ) as pattern_count from  targets; 
TARGET
PATTERN_COUNT
aaa
1
a b c
3
b  b  b  b
4
  bbb xx 
2
wdef def w
3
defxdefdef
1
  (null)    
0
While Oracle 10g supports most of the regular expression features described in this section it unfortunately does not support REGEXP_COUNT. However, we can use the LENGTH and REGEXP_REPLACE functions to produce similar results.
To count the occurrences of the letter "b" in the target strings we can use a query like this (the column INTERMEDIATE_STRING is included to demonstrate the inner workings of each solution).
column intermediate_string format a12 heading "INTERMEDIATE|STRING"
select   target ,   regexp_replace( target, '[^b]', null )  as intermediate_string ,
  NVL( LENGTH( REGEXP_REPLACE( TARGET, '[^b]', NULL ) ), 0 ) as pattern_count
From   targets; 
TARGET
INTERMEDIATE_STRING
PATTERN_COUNT
aaa
  (null)    
0
a b c
b
1
b  b  b  b
bbbb
4
  bbb xx 
bbb
3
wdef def w
  (null)    
0
defxdefdef
  (null)    
0
  (null)    
  (null)    
0
This query finds the total occurrences of the letters "a", "c", or "f" in the target strings.
Select   target ,  regexp_replace( target, '[^acf]', null )                     as intermediate_string ,
  NVL( LENGTH( REGEXP_REPLACE( TARGET, '[^acf]', NULL ) ), 0 ) as pattern_count
from  targets; 

TARGET
INTERMEDIATE_STRING
PATTERN_COUNT
aaa
aaa
3
a b c
ac
2
b  b  b  b
  (null)    
0
  bbb xx 
  (null)    
0
wdef def w
ff
2
defxdefdef
fff
3
  (null)    
  (null)    
0
To calculate the number of times a string like "def" occurs we use a query like this.
select  target ,   regexp_replace( target, '(d)ef|.', '\1' )  as intermediate_string ,
  NVL( LENGTH( REGEXP_REPLACE( TARGET, '(d)ef|.', '\1' ) ), 0 ) as pattern_count
From   targets; 

TARGET
INTERMEDIATE_STRING
PATTERN_COUNT
aaa
(null)               
0
a b c
(null)               
0
b  b  b  b
(null)               
0
  bbb xx 
(null)               
0
wdef def w
dd
2
defxdefdef
ddd
3
(null)               
(null)               
0
 (I found this last technique in this OTN Discussion Forum thread by cd after reading this blog post by Volder.)
To find the number of words in a string we can use a query like this.
Select   target ,  regexp_replace( target, ' *[^ ]+ *', 'w' )   as intermediate_string ,
NVL( LENGTH( REGEXP_REPLACE( TARGET, ' *[^ ]+ *', 'w' ) ), 0 ) as pattern_count
From   targets;

TARGET
INTERMEDIATE_STRING
PATTERN_COUNT
aaa
w
1
a b c
www
3
b  b  b  b
wwww
4
  bbb xx 
ww
2
wdef def w
www
3
defxdefdef
w
1
(null)               
 (null)               
0
Count String Pattern Occurrences (Any Oracle Version)
In any Oracle version we can use the LENGTH and REPLACE functions to count patterns, though the logic is slightly more complex than the techniques above.
To count the occurrences of the letter "b" in the target strings we can use a query like this.
column target_length       heading "TARGET|LENGTH"
column intermediate_length heading "INTERMEDIATE|LENGTH"
select  target ,  replace( target, 'b', null )                     as intermediate_string ,
  nvl( length( target ), 0 )                       as target_length ,
  nvl( length( replace( target, 'b', null ) ), 0 ) as intermediate_length ,
  NVL( LENGTH( TARGET ), 0 )    - NVL( LENGTH( REPLACE( TARGET, 'b', NULL ) ), 0 )
    as pattern_count
from   targets; 

TARGET
INTERMEDIATE_STRING
TARGET_LENGTH
INTERMEDIATE_LENGTH
PATTERN_COUNT
aaa
aaa
3
3
0
a b c
a  c
5
4
1
b  b  b  b
     
10
6
4
  bbb xx 
   xx 
10
7
3
wdef def w
wdef def w
10
10
0
defxdefdef
defxdefdef
10
10
0
 (null)               
 (null)               
0
0
0
(The INTERMEDIATE_STRING value in the third row is six spaces. In the fourth row it is three spaces followed by "xx" followed by two spaces.)  
This query finds the total occurrences of the letters "a", "c", or "f" in the target strings.
Select target, translate(target, 'xacf', 'x')  as intermediate_string,  nvl(length( target ),0)  as target_length , nvl( length( translate( target, 'xacf', 'x' ) ), 0 ) as intermediate_length ,
  NVL( LENGTH( TARGET ), 0 ) - NVL( LENGTH( TRANSLATE( TARGET, 'xacf', 'x' ) ), 0 )
    as pattern_count
from   targets;

TARGET
INTERMEDIATE_STRING
TARGET_LENGTH
INTERMEDIATE_LENGTH
PATTERN_COUNT
aaa
  (null)                
3
0
3
a b c
 b
5
3
2
b  b  b  b
b  b  b  b
10
10
0
  bbb xx 
  bbb xx 
10
10
0
wdef def w
wde de w
10
8
2
defxdefdef
dexdede
10
7
3
  (null)               
  (null)               
0
0
0
To calculate the number of times a string like "def" occurs we use a query like this.
column intermediate_string_1 format a12 heading "INTERMEDIATE|STRING 1"
column intermediate_string_2 format a12 heading "INTERMEDIATE|STRING 2"

select   target, replace( target, 'def', 'd' )  as intermediate_string_1 ,  replace (target,'def', null) as intermediate_string_2,  nvl(length(replace(target,'def','d' )),0) as length_1 ,  nvl( length( replace( target, 'def', null ) ), 0 ) as length_2 , NVL( LENGTH( REPLACE( TARGET, 'def', 'd' ) ), 0 ) - NVL( LENGTH( REPLACE( TARGET, 'def', NULL ) ), 0 )   as pattern_count  
from   targets;
TARGET
INTERMEDIATE_STRING_1
INTERMEDIATE_STRING_2
LENGTH_1
LENGTH_2
PATTERN_COUNT
aaa
aaa
aaa
3
3
0
a b c
a b c
a b c
5
5
0
b  b  b  b
b  b  b  b
b  b  b  b
10
10
0
  bbb xx 
  bbb xx 
  bbb xx 
10
10
0
wdef def w
wd d w
w  w
6
4
2
defxdefdef
dxdd
x
4
1
3
   (null)               
   (null)               
   (null)               
0
0
0
To find the number of words in a string we can use a query like this. Note the query will not work for target strings containing more than one space between words.
column trimmed_length_plus_1 heading "TRIMMED LENGTH|PLUS 1"
select   target ,  trimmed ,  replace( trimmed, ' ', null )                     as intermediate_string ,
  nvl( length( trimmed ) + 1, 0 )                   as trimmed_length_plus_1 ,
  nvl( length( replace( trimmed, ' ', null ) ), 0 ) as intermediate_length ,
  NVL( LENGTH( TRIMMED ) + 1, 0 )  - NVL( LENGTH( REPLACE( TRIMMED, ' ', NULL ) ), 0 )    as pattern_count
from  ( select target, ltrim( rtrim(target) ) as trimmed from targets )
where  nvl( ltrim( rtrim( target ) ), 'x' ) not like '%  %' ;

TARGET
TRIMMED
INTERMEDIATE_STRING
TRIMMED_LENGTH_PLUS_1
INTERMEDIATE_LENGTH
PATTERN_COUNT
aaa
aaa
aaa
4
3
1
a b c
a b c
abc
6
3
3
  bbb xx 
bbb xx
bbbxx
7
5
2
wdef def w
wdef def w
wdefdefw
11
8
3
defxdefdef
defxdefdef
defxdefdef
11
10
1
 (null)               
 (null)               
 (null)               
0
0
0
In Oracle 8i or later the LTRM and RTRIM functions can be replaced with a single TRIM call.

select  target ,  trimmed ,  replace( trimmed, ' ', null )                     as intermediate_string ,
  nvl( length( trimmed ) + 1, 0 )                   as trimmed_length_plus_1 ,
  nvl( length( replace( trimmed, ' ', null ) ), 0 ) as intermediate_length ,
  nvl( length( trimmed ) + 1, 0 )  - nvl( length( replace( trimmed, ' ', null ) ), 0 )     as pattern_count
from  ( select target, TRIM( target ) as trimmed from targets )
Where   nvl( TRIM( target ), 'x' ) not like '%  %' ;
 TARGET
TRIMMED
INTERMEDIATE_STRING
TRIMMED_LENGTH_PLUS_1
INTERMEDIATE_LENGTH
PATTERN_COUNT
aaa
aaa
aaa
4
3
1
a b c
a b c
abc
6
3
3
  bbb xx 
bbb xx
bbbxx
7
5
2
wdef def w
wdef def w
wdefdefw
11
8
3
defxdefdef
defxdefdef
defxdefdef
11
10
1
  (null)               
  (null)               
  (null)               
0
0
0

No comments :

Post a Comment