Monday, October 8, 2012

How To Tokenize Or Parse A String In PLSQL

Tokenize or parse string PLSQLIf you have worked with SQL and PL/SQL for a while, you have probably been in a situation where you have a string that is delimited with a character, and you want the chunks of the string by themselves… In this tutorial we are going to look at a few ways to do this.

Introduction

In my previous post ( Quick Tip: Slicing A Long PLSQL String Into Smaller Pieces ), I kind of touched something similar, but there was no logic in this code that handled any delimiters, etc. We just brutally cut the string into pieces.
So, in this tutorial we are actually going to show how to pick out the pieces more carefully. Throughout the examples we are going to use a collection, SQL, the REGEXPR_SUBSTR function, SUBSTR, INSTR, and a PL/SQL Collection.
Ok, let’s get down to some business…

Example 1: Using SQL and REGEXP_SUBSTR

In this example we are only going to look at how to tokenize ONE single string, since that is the purpose of this tutorial. If you would like to see an example using REGEXPR_SUBSTR handling multiple records, please take a look at Alex Nuijten’s post found here.
For this example, here is the code:
1SQL> set serveroutput on
2SQL>
3SQL> DECLARE
4  2     CURSOR cur IS
5  3          WITH qry AS (
6  4                          SELECT 'Paris#London#Rome#Oslo#Amsterdam#New York' city_string
7  5                          FROM   dual
8  6          )
9  7          SELECT regexp_substr (city_string, '[^#]+', 1, ROWNUM) city
10  8          FROM   qry
11  9          CONNECT BY LEVEL <= LENGTH(regexp_replace (city_string, '[^#]+')) + 1
12 10          ;
13 11
14 12  BEGIN
15 13     FOR rec IN cur LOOP
16 14          dbms_output.put_line('City:' || rec.city);
17 15      END LOOP;
18 16  END;
19 17  /
20
21City:Paris
22City:London
23City:Rome
24City:Oslo
25City:Amsterdam
26City:New York
27
28PL/SQL procedure successfully completed
Ok, so then we’re done with the SQL part. Let’s jump over to PL/SQL code to do some cool stuff.

Example 2: Using a PL/SQL Collection

In this example we are going to tokenize our string and put it into a PL/SQL collection for later use.
1SQL> DECLARE
2  2
3  3     TYPE typTokenTab IS TABLE OF VARCHAR(20) INDEX BY BINARY_INTEGER;
4  4     vglCityString   VARCHAR2(200);
5  5
6  6     tglTokenTab     typTokenTab;
7  7
8  8     --  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9  9     --  Function: creTokenList
10 10      --
11 11      -- This function takes a string with "tokens" delimited by pDelimiter
12 12      -- and put each "token" into a separate record in a PL/SQL collection. The
13 13      -- PL/SQL collection is returned back to the caller of the function.
14 14     --  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15 15     FUNCTION creTokenList(pLine IN VARCHAR2, pDelimiter IN VARCHAR2) RETURN typTokenTab IS
16 16         sLine       VARCHAR2(2000);
17 17         nPos        INTEGER;
18 18         nPosOld     INTEGER;
19 19          nIndex     INTEGER;
20 20          nLength        INTEGER;
21 21          nCnt       INTEGER;
22 22          sToken     VARCHAR2(200);
23 23          tTokenTab  typTokenTab;
24 24      BEGIN
25 25         sLine := pLine;
26 26         IF (SUBSTR(sLine, LENGTH(sLine), 1) <> '|') THEN
27 27             sLine := sLine || '|';
28 28         END IF;
29 29
30 30         nPos := 0;
31 31          sToken := '';
32 32          nLength := LENGTH(sLine);
33 33          nCnt := 0;
34 34
35 35          FOR nIndex IN 1..nLength LOOP
36 36             IF ((SUBSTR(sLine, nIndex, 1) = pDelimiter) OR (nIndex = nLength)) THEN
37 37                 nPosOld := nPos;
38 38                 nPos := nIndex;
39 39                 nCnt := nCnt + 1;
40 40                 sToken := SUBSTR(sLine, nPosOld + 1, nPos - nPosOld - 1);
41 41
42 42                 tTokenTab(nCnt) := sToken;
43 43              END IF;
44 44
45 45          END LOOP;
46 46
47 47          RETURN tTokenTab;
48 48      END creTokenList;
49 49
50 50  BEGIN
51 51     vglCityString := 'Paris#London#Rome#Oslo#Amsterdam#New York';
52 52
53 53     tglTokenTab := creTokenList( vglCityString, '#');
54 54
55 55      FOR indx IN tglTokenTab.FIRST..tglTokenTab.LAST LOOP
56 56          dbms_output.put_line('City:' || tglTokenTab(indx));
57 57      END LOOP;
58 58
59 59     tglTokenTab.DELETE;
60 60  END;
61 61  /
62
63City:Paris
64City:London
65City:Rome
66City:Oslo
67City:Amsterdam
68City:New York
69
70PL/SQL procedure successfully completed
Remember that  when we are working with PL/SQL Collections, we need to clean up our mess. What I mean with this is that when we are playing with memory it is like cluttering up a child’s room with toys. We need to put the toys back into the box and into the closet again so that we can use the floor space for something else later. Hmm…to be technically correct we would need to put the toys in the box and give the box to the Salvation Army…but I don’t think we would have very happy children then.

Example # 3: Using INSTR and SUBSTR to get a token at a certain position

So, let us say that you have a customer notes table with a field that always follows the same structure for how the customer service reps enter the notes. Let’s say it has the following format:
;;#
That would i.e. give us something similar to this in our table:
2/5/2012;23412;Customer called reg. invoice#2/6/2012;23412;Contacted Lisa in the collection department.#2/6/2012;23412;Compliance department have decided that customer have sold products earlier on eBay. Customer’s status set to ON-HOLD. Waiting for further instruction
If we break down the notes field we will see that there are actually three different parts in the note:
  • 2/5/2012;23412;Customer called reg. invoice
  • 2/6/2012;23412;Contacted Lisa in the collection department.
  • 2/6/2012;23412;Compliance department have decided that customer have sold products earlier on eBay. Customer’s status set to ON-HOLD. Waiting for further instruction
Our task is to create a function that can get any of the three parts, AND get the individual elements of a single note.
Here is a sample of how this can be done:
1SQL> set serveroutput on
2SQL>
3SQL> DECLARE
4  2     vglNotesString  VARCHAR2(2000);
5  3
6  4     FUNCTION getToken(
7  5                         pString IN VARCHAR2
8  6                          ,pPosition IN INTEGER
9  7                          ,pDelimiter IN VARCHAR2 DEFAULT ';'
10  8                          ,pNullVal IN VARCHAR2 DEFAULT 'n/a'
11  9      ) RETURN VARCHAR2 IS
12 10         sLine       VARCHAR2(2000);
13 11         iStartPos   INTEGER;
14 12         iEndPos     INTEGER;
15 13          iLength        INTEGER;
16 14          vToken     VARCHAR2(200);
17 15         bContinue   BOOLEAN := TRUE;
18 16      BEGIN
19 17          -- Get start position
20 18          IF ( pPosition = 1) THEN
21 19              iStartPos := 1;
22 20          ELSIF ( pPosition < 0) THEN
23 21              iStartPos := INSTR(pString, pDelimiter, -1, ABS(pPosition)) + 1;
24 22         ELSE
25 23              iStartPos := INSTR(pString, pDelimiter, 1, pPosition - 1) + 1;
26 24             IF (iStartPos = 1) THEN
27 25                  bContinue := FALSE;
28 26              END IF;
29 27          END IF;
30 28
31 29         IF (bContinue) THEN
32 30             -- Get the end position, and then the length of the token
33 31             IF (pPosition > 0) THEN
34 32                  iEndPos := INSTR(pString, pDelimiter, 1, pPosition );
35 33                  IF (iEndPos > 0) THEN
36 34                      iLength := iEndPos - iStartPos;
37 35                  ELSIF (iEndPos = 0) THEN
38 36                      iLength := LENGTH(pString) - iStartPos + 1;
39 37                  END IF;
40 38             ELSE
41 39                  IF (pPosition = -1) THEN
42 40                     iEndPos := LENGTH(pString);
43 41                  ELSE
44 42                     iEndPos := INSTR(pString, pDelimiter, -1, ABS(pPosition) - 1) - 1;
45 43                 END IF;
46 44                 iLength := iEndPos - iStartPos + 1;
47 45              END IF;
48 46
49 47              vToken := NVL(SUBSTR(pString, iStartPos, iLength), pNullVal);
50 48         ELSE
51 49              vToken := pNullVal;
52 50          END IF;
53 51
54 52          RETURN TO_CHAR(vToken);
55 53      END getToken;
56 54
57 55  BEGIN
58 56     vglNotesString := '2/5/2012;23412;Customer called reg. invoice#2/6/2012;23412;Contacted Lisa in the collection ';
59 57      vglNotesString := vglNotesString || 'department.#2/6/2012;23412;Compliance department have decided that customer ';
60 58      vglNotesString := vglNotesString || 'have sold products earlier on eBay. Customer''s status set to ON-HOLD. ';
61 59      vglNotesString := vglNotesString || 'Waiting for further instruction';
62 60
63 61
64 62     dbms_output.put_line('Positive tokenizing:');
65 63     dbms_output.put_line('Note 1:[' || getToken(vglNotesString, 1, '#', 'n/a') || ']');
66 64     dbms_output.put_line('Note 2:[' || getToken(vglNotesString, 2, '#', 'n/a') || ']');
67 65     dbms_output.put_line('Note 3:[' || getToken(vglNotesString, 3, '#', 'n/a') || ']');
68 66     dbms_output.put_line('Note 4:[' || getToken(vglNotesString, 4, '#', 'n/a') || ']');
69 67     dbms_output.put_line(' -- ');
70 68     dbms_output.put_line('Negative tokenizing:');
71 69     dbms_output.put_line('Note -1:[' || getToken(vglNotesString, -1, '#', 'n/a') || ']');
72 70     dbms_output.put_line('Note -2:[' || getToken(vglNotesString, -2, '#', 'n/a') || ']');
73 71     dbms_output.put_line('Note -3:[' || getToken(vglNotesString, -3, '#', 'n/a') || ']');
74 72     dbms_output.put_line('Note -4:[' || getToken(vglNotesString, -4, '#', 'n/a') || ']');
75 73     dbms_output.put_line(' -- ');
76 74     dbms_output.put_line('Positive tokenizing of a tokenized string:');
77 75     dbms_output.put_line('Note 1:[' || getToken(getToken(vglNotesString, 1, '#', 'n/a'), 1, ';', 'n/a') || ']');
78 76     dbms_output.put_line('Note 2:[' || getToken(getToken(vglNotesString, 1, '#', 'n/a'), 2, ';', 'n/a') || ']');
79 77     dbms_output.put_line('Note 3:[' || getToken(getToken(vglNotesString, 1, '#', 'n/a'), 3, ';', 'n/a') || ']');
80 78  END;
81 79  /
82
83Positive tokenizing:
84Note 1:[2/5/2012;23412;Customer called reg. invoice]
85Note 2:[2/6/2012;23412;Contacted Lisa in the collection department.]
86Note 3:[2/6/2012;23412;Compliance department have decided that customer have sold products earlier on eBay. Customer's status set to ON-HOLD. Waiting for further instruction]
87Note 4:[n/a]
88 --
89Negative tokenizing:
90Note -1:[2/6/2012;23412;Compliance department have decided that customer have sold products earlier on eBay. Customer's status set to ON-HOLD. Waiting for further instruction]
91Note -2:[2/6/2012;23412;Contacted Lisa in the collection department.]
92Note -3:[2/5/2012;23412;Customer called reg. invoice]
93Note -4:[n/a]
94 --
95Positive tokenizing of a tokenized string:
96Note 1:[2/5/2012]
97Note 2:[23412]
98Note 3:[Customer called reg. invoice]
99
100PL/SQL procedure successfully completed

As you can see, I left the third parameter as a VARCHAR2, rather than using a CHAR. The reason for this is that sometimes our delimiter can actually consist of more than one character, like i.e. “##
In addition I have made the function so that it can handle both positive and negative string positioning. This will be valuable if you for some reason wanted to tokenize / parse a string where the token positions are numbered from the end of the string, going towards the beginning.
Both the INSTR and the SUBSTR functions can handle both positive and negative positioning.

In Conclusion

As you can see, there are many ways to tokenize or parse a string in PL/SQL. In this little tutorial we have touched a few of them. No matter what method you choose, an advice would be to choose your delimiter wisely. Usually I use a character that most likely would not show up in a normal string, like: #, @, $, ^, etc.
One thing to consider as well is what you want your function to do if there is nothing returned. Would you i.e. just return the text “n/a“, or something?

No comments :

Post a Comment