Friday, August 26, 2016

Dead Lock - 2

What is a database deadlock? Provide an example and explanation of a deadlock in a database.



In a database, a deadlock is a situation that occurs when two or more different database sessions have some data locked, and each database session requests a lock on the data that another, different, session has already locked. Because the sessions are waiting for each other, nothing can get done, and the sessions just waste time instead. This scenario where nothing happens because of sessions waiting indefinitely for each other is known as deadlock.
If you are confused, some examples of deadlock should definitely help clarify what goes on during deadlock. And, you should probably read our explanation of database locks before proceeding since that will help your understanding as well.

Database deadlock example

Suppose we have two database sessions called A and B. Let’s say that session A requests and has a lock on some data – and let’s call the data Y. And then session B has a lock on some data that we will call Z. But now, lets say that session A needs a lock on data Z in order to run another SQL statement, but that lock is currently held by session B. And, let’s say that session B needs a lock on data Y, but that lock is currently held by session A. This means that session B is waiting on session A’s lock and session B is waiting for session A’s lock. And this is what deadlock is all about!
Let’s go through a more detailed (and less abstract) example of deadlock so that you can get a more specific idea of how deadlock can arise.

Database deadlock example in banking



Let’s use an example of two database users working at a bank – let’s call those database users X and Y. Let’s say that user X works in the customer service department and has to update the database for two of the banks customers, because one customer (call him customer A) incorrectly received $5,000 in his account when it should have gone to another customer (call him customer B) – so user X has to debit customer X’s account by $5,000 and also credit customer Y’s account $5,000.
Note that the crediting of customer B and debiting of customer A will be run as a single transaction – this is important for the discussion that follows.
Now, let’s also say that the other database user – Y – works in the IT department and has to go through the customers table and update the zip code of all customers who currently have a zip code of 94520, because that zip code has now been changed to 94521. So, the SQL for this would simply have a WHERE clause that would limit the update to customers with a zip code of 94520.
Also, both customers A and B currently have zip codes of 94520, which means that their information will be updated by database user Y.
Here is a breakdown of the events in our fictitious example that lead to deadlock:
  • 1. Database user X in the customer service department selects customer A’s data and updates A’s bank balance to debit/decrease it by $5,000. However, what’s important here is that there is no COMMIT issued yet because database user X still has to update customer B’s balance to increase/credit by $5,000 – and those 2 separate SQL statements will run as a single SQL transaction. Most importantly, this means that database user X still holds a lock on the row for customer A because his transaction is not fully committed yet (he still has to update customer A). The lock on the row for customer A will stay until the transaction is committed.
  • 2. Database user Y then has to run his SQL to update the zip codes for customers with zip codes of 94520. The SQL then updates customer B’s zip code. But, because the SQL statement from user Y must be run as a single transaction, the transaction has not committed yet because all of the customers haven’t had their zip codes changed yet. So, this means that database user Y holds a lock on the row for customer B. .
  • 3. Now, Database user X still has to run the SQL statement that will update customer B’s balance to increase it by $5,000. But, now the problem is that database user Y has a lock on the row for customer B. This means that the request to update customer B’s balance must wait for user Y to release the lock on customer B. So, database user X is waiting for user Y to release a lock on customer B.
  • 4. Now, the SQL statement being run by user Y tries to update the zip code for customer A. But, this update can not happen because user X holds a lock on customer A’s row. So, user Y is waiting for a lock to be released by user X.
  • 5. Now you can see that we have user X waiting for user Y to release a lock and user Y waiting for user X to release a lock. This is the situation of deadlock, since neither user can make any progress, and nothing happens because they are both waiting for each other. So, in theory, these two database sessions will be stalled forever. But, read on to see how some DBMS’s deal with this unique situation.

Database deadlock prevention

So now you have seen an example of deadlock. The question is how do DBMS’s deal with it? Well, very few modern DBMS’s can actually prevent or avoid deadlocks, because there’s a lot of overhead required in order to do so. This is because the DBMS’s that do try to prevent deadlocks have to try to predict what a database user will do next, and the theory behind deadlock prevention is that each lock request is inspected to see if it has the potential to cause contention. If that is the case, then the lock is not allowed to be placed.

Database deadlock detection

Instead of deadlock prevention, the more popular approach to dealing with database deadlocks is deadlock detection. What is deadlock detection? Well, deadlock detection is based on the principle that one of the requests that caused the deadlock should be aborted.

How does deadlock detection work?

There are two common approaches to deadlock detection: 1. Whenever a session is waiting for a lock to be released it is in what’s known as a “lock wait” state. One way deadlock detection is implemented is to simply set the lock wait time period to a certain preset limit (like 5 seconds). So, if a session waits more than 5 seconds for a lock to free up, then that session will will be terminated. 2. The RDBMS can regularly inspect all the locks currently in place to see if there are any two sessions that have locked each other out and are in a state of deadlock.
In either of the deadlock detection methods, one of the requests will have to be terminated to stop the deadlock. This also means that any transaction changes which came before the request will have to be rolled back so that the other request can make progress and finish.

Dead Lock - 1

What is a database lock in the context of SQL? Provide an example and explanation.

A database lock is used to “lock” some data in a database so that only one database user/session may update that particular data. So, database locks exist to prevent two or more database users from updating the same exact piece of data at the same exact time. When data is locked, then that means that another database session can NOT update that data until the lock is released (which unlocks the data and allows other database users to update that data. Locks are usually released by either a ROLLBACK or COMMIT SQL statement.

What happens when another session tries to update the locked data?



Suppose database session A tries to update some data that is already locked by database session B. What happens to session A? Well, session A will actually be placed in what’s called a lock wait state, and session A will be stopped from making further progress with any SQL transaction that it’s performing. Another way of saying this is that session A will be “stalled” until session B releases the lock on that data.
If a session ends up waiting too long for some locked data, then some databases, like DB2 from IBM, will actually time out after a certain amount of time and return an error instead of waiting and then updating the data as requested. But some databases, like Oracle, may handle the situation differently – Oracle can actually leave a session in a lock wait state for an indefinite amount of time. So, there are a lot of differences between different database vendors in terms of how they choose to deal with locks and other sessions waiting for locks to be released.

Database locking techniques

Database locks can actually be placed at different levels – also known as lock granularity – within the database.

Here is a list of the usual lock levels and types supported, and more information on what each technique means:

Database level locking


With database level locks, the entire database is locked – which means that only one database session can apply any updates to the database. This type of lock is not often used, because it obviously prevents all users except one from updating anything in the database. But, this lock can be useful when some major support update is necessary – like upgrading the database to a new version of the software. Oracle actually has an exclusive mode, which is used to allow just one user session use the database – this is basically a database lock.

File level locking

With a file lock level, an entire database file is locked. What exactly is a file in a database? Well, a file can have a wide variety of data – inside a file there could be an entire table, a part of a table, or even parts of different tables. Because of the variety of data stored inside a file, this type of lock level is less favored.

Table level locking


A table level lock is pretty straight forward – it means that an entire table is locked as a whole. This lock level comes in handy when making a change that affects an entire table, like updating all the rows in a table, or modifiying the table to add or remove columns. In Oracle, this is known as a DDL lock, because it’s used with DDL (Data Definition Language) statements like CREATE, ALTER, and DROP – basically statements that modify the entire table somehow or the other.

Page or block level locking

Block, or page, level locking occurs when a block or page that is part of a database file is locked. To read more about pages and blocks if you are not already familiar with them, then go here: Pages versus blocks.
Because the data that can be stored in blocks/pages can be wide and varied, page/block locking is less favored in databases today.

Column level locking


A column level lock just means that some columns within a given row in a given table are locked. This form of locking is not commonly used because it requires a lot of resources to enable and release locks at this level. Also, there is very little support for column level locking in most database vendors.

Row level locking

A row level lock applies to a row in a table. This is also the most commonly locking level, and practically all major database vendors support row level locks.

Are locks automatically used by databases?



When data is either deleted or updated locks are always used – even if a database user doesn’t write his/her SQL to explicitly say that a lock must be used. Many of the RDBMS’s out there today also have support to use the “FOR UPDATE OF” clause combined with a normal SELECT statement. The FOR UPDATE OF clause basically says that the database user intends to update some data – although the database user is not required to make changes to that particular data either. And, because the intent of updating data is declared, it means that a lock will be placed on that data as well.

Example of database locking

As a simple example of when locking would be used by database, suppose we have the following SQL:
UPDATE some_table SET some_field = "some_value" 
WHERE some_column = "XYZ";
The SQL statement above will lock the row or rows which have a value of “XYZ” for the column named “some_column”. The locking of the row(s) happens behind the scenes as part of the RDBMS software, and it prevents other database user sessions from updating the same row(s) at the same exact time as well.

Can data be read when a lock is in place?

It depends on the lock, since some locks are read-exclusive, which means that other sessions in the database can not even read the locked data – so if .

What is the point of database locking?

If it’s not already clear to you, the reason we have database locks is to prevent the potential loss of data that could happen if updates are applied concurrently – or at the same exact time. If two different database users are allowed to update the same data at the same exact time, then the results could be potentially confusing and disastrous. But if that same data were locked, then that issue would not arise, since only one user could update the locked data at a time.

What is lock contention?

One problem that occurs with having locks is that locks can cause what’s known ascontention, which means that because there are locks on the data, sessions that exist at the same time (concurrent sessions) are essentially competing for the right to apply updates on the same data, because that data may be locked by any given session. In the best case, lock contention means that some user processes run slower because a session is waiting for a lock. In the worst case, having sessions compete for locks can make sessions stall for an indefinite period of time.
When sessions do stall for an indefinite period of time, that is known as deadlock, which you can read more about here: Database deadlock.

Lock Escalation

You should also read about the concept of Lock Escalation, which is a built in feature of many RDBMS’s today.

Tuesday, May 10, 2016

Comparing Common Queries Between Test and Production

Comparing Common Queries Between Test and Production
The developers complained that their test database was so much slower than production that they could not use it to really test whether their batch processes would run fast enough when migrated to production. They did not give me any particular queries to check. Instead they said that the system was generally too slow. So, I went through a process to find SQL statements that they had run in test and that normally run in production and compare their run times. I thought that I would document the process that I went through here.
First I found the top 100 queries by elapsed time on both the test and production databases using this query:
column FORCE_MATCHING_SIGNATURE format 99999999999999999999
select FORCE_MATCHING_SIGNATURE from
(select
FORCE_MATCHING_SIGNATURE,
sum(ELAPSED_TIME_DELTA) total_elapsed
from DBA_HIST_SQLSTAT
where
FORCE_MATCHING_SIGNATURE is not null and
FORCE_MATCHING_SIGNATURE <>0
group by FORCE_MATCHING_SIGNATURE
order by total_elapsed desc)
where rownum < 101;
The output looked like this:
FORCE_MATCHING_SIGNATURE
------------------------
      944718698451269965
     4634961225655610267
    15939251529124125793
    15437049687902878835
     2879196232471320459
    12776764566159396624
    14067042856362022182
...
Then I found the signatures that were in common between the two lists.
insert into test_sigs values (944718698451269965);
insert into test_sigs values (4634961225655610267);
insert into test_sigs values (15939251529124125793);
...
insert into prod_sigs values (3898230136794347827);
insert into prod_sigs values (944718698451269965);
insert into prod_sigs values (11160330134321800286);
...
select * from test_sigs
intersect
select * from prod_sigs;
This led to 32 values of FORCE_MATCHING_SIGNATURE which represented queries that ran on both test and production, except for the possible difference in constants.
Next I looked at the overall performance of these 32 queries in test and production using this query:
create table common_sigs
(FORCE_MATCHING_SIGNATURE number);

insert into common_sigs values (575231776450247964);
insert into common_sigs values (944718698451269965);
insert into common_sigs values (1037345866341698119);
...

select
sum(executions_delta) total_executions,
sum(ELAPSED_TIME_DELTA)/(sum(executions_delta)*1000),
sum(CPU_TIME_DELTA)/(sum(executions_delta)*1000),
sum(IOWAIT_DELTA)/(sum(executions_delta)*1000),
sum(CLWAIT_DELTA)/(sum(executions_delta)*1000),
sum(APWAIT_DELTA)/(sum(executions_delta)*1000),
sum(CCWAIT_DELTA)/(sum(executions_delta)*1000),
sum(BUFFER_GETS_DELTA)/sum(executions_delta),
sum(DISK_READS_DELTA)/sum(executions_delta),
sum(ROWS_PROCESSED_DELTA)/sum(executions_delta)
from DBA_HIST_SQLSTAT ss,common_sigs cs
where
ss.FORCE_MATCHING_SIGNATURE = cs.FORCE_MATCHING_SIGNATURE;
Here is part of the output:
TOTAL_EXECUTIONS Elapsed Average ms CPU Average ms IO Average ms
---------------- ------------------ -------------- -------------
         5595295         366.185529      241.92785    59.8682797
          430763         1273.75822     364.258421    1479.83294
The top line is production and the bottom is test.
This result supported the development team’s assertion that test was slower than production. The 32 queries averaged about 3.5 times longer run times in test than in production. Also, the time spent on I/O was about 25 times worse. I am not sure why the I/O time exceeded the elapsed time on test. I guess it has something to do with how Oracle measures I/O time. But clearly on average these 32 queries are much slower on test and I/O time probably caused most of the run time difference.
After noticing this big difference between test and production I decided to get these same sorts of performance metrics for each signature to see if certain ones were worse than others. The query looked like this:
select
ss.FORCE_MATCHING_SIGNATURE,
sum(executions_delta) total_executions,
sum(ELAPSED_TIME_DELTA)/(sum(executions_delta)*1000),
sum(CPU_TIME_DELTA)/(sum(executions_delta)*1000),
sum(IOWAIT_DELTA)/(sum(executions_delta)*1000),
sum(CLWAIT_DELTA)/(sum(executions_delta)*1000),
sum(APWAIT_DELTA)/(sum(executions_delta)*1000),
sum(CCWAIT_DELTA)/(sum(executions_delta)*1000),
sum(BUFFER_GETS_DELTA)/sum(executions_delta),
sum(DISK_READS_DELTA)/sum(executions_delta),
sum(ROWS_PROCESSED_DELTA)/sum(executions_delta)
from DBA_HIST_SQLSTAT ss,common_sigs cs
where ss.FORCE_MATCHING_SIGNATURE = cs.FORCE_MATCHING_SIGNATURE
having
sum(executions_delta) > 0
group by
ss.FORCE_MATCHING_SIGNATURE
order by
ss.FORCE_MATCHING_SIGNATURE;
I put together the outputs from running this query on test and production and lined the result up like this:
FORCE_MATCHING_SIGNATURE    PROD Average ms    TEST Average ms
------------------------ ------------------ ------------------
      575231776450247964         20268.6719         16659.4585
      944718698451269965         727534.558          3456111.6 *
     1037345866341698119         6640.87641         8859.53518
     1080231657361448615         3611.37698         4823.62857
     2879196232471320459         95723.5569         739287.601 *
     2895012443099075884         687272.949         724081.946
     3371400666194280661         1532797.66         761762.181
     4156520416999188213         109238.997         213658.722
     4634693999459450255          4923.8897         4720.16455
     5447362809447709021         2875.37308          2659.5754
     5698160695928381586         17139.6304         16559.1932
     6260911340920427003         290069.674         421058.874 *
     7412302135920006997         20039.0452         18951.6357
     7723300319489155163         18045.9756         19573.4784
     9153380962342466451         1661586.53         1530076.01
     9196714121881881832         5.48003488         5.13169472
     9347242065129163091         4360835.92         4581093.93
    11140980711532357629         3042320.88         5048356.99
    11160330134321800286         6868746.78         6160556.38
    12212345436143033196          5189.7972         5031.30811
    12776764566159396624         139150.231         614207.784  *
    12936428121692179551         3563.64537         3436.59365
    13637202277555795727          7360.0632         6410.02772
    14067042856362022182         859.732015         771.041714
    14256464986207527479         51.4042938         48.9237251
    14707568089762185958         627.586095          414.14762
    15001584593434987669         1287629.02         1122151.35
    15437049687902878835         96014.9782         996974.876  *
    16425440090840528197         48013.8912         50799.6184
    16778386062441486289         29459.0089         26845.8327
    17620933630628481201         51199.0511         111785.525  *
    18410003796880256802         581563.611         602866.609
I put an asterisk (*) beside the six queries that were much worse on test than production. I decided to focus on these six to get to the bottom of the reason between the difference. Note that many of the 32 queries ran about the same on test as prod so it really isn’t the case that everything was slow on test.
Now that I had identified the 6 queries I wanted to look at what they were spending their time on including both CPU and wait events. I used the following query to use ASH to get a profile of the time spent by these queries on both databases:
select
case SESSION_STATE
when 'WAITING' then event
else SESSION_STATE
end TIME_CATEGORY,
(count(*)*10) seconds
from DBA_HIST_ACTIVE_SESS_HISTORY
where
FORCE_MATCHING_SIGNATURE in
('944718698451269965',
'2879196232471320459',
'6260911340920427003',
'12776764566159396624',
'15437049687902878835',
'17620933630628481201')
group by SESSION_STATE,EVENT
order by seconds desc;
The profile looked like this in test:
TIME_CATEGORY            SECONDS
------------------------ -------
db file parallel read     207450
ON CPU                    141010
db file sequential read    62990
direct path read           36980
direct path read temp      29240
direct path write temp     23110
The profile looked like this in production:
TIME_CATEGORY            SECONDS
------------------------ -------
ON CPU                    433260
PX qref latch              64200
db file parallel read      35730
db file sequential read    14360
direct path read           12750
direct path write temp     12000
So, I/O waits dominate the time on test but not production. Since db file parallel read and db file sequential read were the top I/O waits for these 6 queries I used ash to see which of the 6 spent the most time on these waits.
db file parallel read:
select
  2  sql_id,
  3  (count(*)*10) seconds
  4  from DBA_HIST_ACTIVE_SESS_HISTORY
  5  where
  6  FORCE_MATCHING_SIGNATURE in
  7  ('944718698451269965',
  8  '2879196232471320459',
  9  '6260911340920427003',
 10  '12776764566159396624',
 11  '15437049687902878835',
 12  '17620933630628481201') and
 13  event='db file parallel read'
 14  group by sql_id
 15  order by seconds desc;

SQL_ID           SECONDS
------------- ----------
ak2wk2sjwnd34     159020
95b6t1sp7y40y      37030
brkfcwv1mqsas      11370
7rdc79drfp28a         30
db file sequential read:
select
  2  sql_id,
  3  (count(*)*10) seconds
  4  from DBA_HIST_ACTIVE_SESS_HISTORY
  5  where
  6  FORCE_MATCHING_SIGNATURE in
  7  ('944718698451269965',
  8  '2879196232471320459',
  9  '6260911340920427003',
 10  '12776764566159396624',
 11  '15437049687902878835',
 12  '17620933630628481201') and
 13  event='db file sequential read'
 14  group by sql_id
 15  order by seconds desc;

SQL_ID           SECONDS
------------- ----------
95b6t1sp7y40y      26840
ak2wk2sjwnd34      22550
6h0km9j5bp69t      13300
brkfcwv1mqsas        170
7rdc79drfp28a        130
Two queries stood out at the top waiters on these two events: 95b6t1sp7y40y and ak2wk2sjwnd34. Then I just ran my normal sqlstat query for both sql_ids for both test and production to find out when they last ran. Here is what the query looks like for ak2wk2sjwnd34:
select ss.sql_id,
ss.plan_hash_value,
sn.END_INTERVAL_TIME,
ss.executions_delta,
ELAPSED_TIME_DELTA/(executions_delta*1000) "Elapsed Average ms",
CPU_TIME_DELTA/(executions_delta*1000) "CPU Average ms",
IOWAIT_DELTA/(executions_delta*1000) "IO Average ms",
CLWAIT_DELTA/(executions_delta*1000) "Cluster Average ms",
APWAIT_DELTA/(executions_delta*1000) "Application Average ms",
CCWAIT_DELTA/(executions_delta*1000) "Concurrency Average ms",
BUFFER_GETS_DELTA/executions_delta "Average buffer gets",
DISK_READS_DELTA/executions_delta "Average disk reads",
ROWS_PROCESSED_DELTA/executions_delta "Average rows processed"
from DBA_HIST_SQLSTAT ss,DBA_HIST_SNAPSHOT sn
where ss.sql_id = 'ak2wk2sjwnd34'
and ss.snap_id=sn.snap_id
and executions_delta > 0
and ss.INSTANCE_NUMBER=sn.INSTANCE_NUMBER
order by ss.snap_id,ss.sql_id;
I found two time periods where both of these queries were recently run on both test and production and got an AWR report for each time period to compare them.
Here are a couple of pieces of the AWR report for the test database:

Here are similar pieces for the production database:





What really stood out to me was that the wait events were so different. In production the db file parallel read waits averaged around 1 millisecond and the db file sequential reads averaged under 1 ms. On test they were 26 and 5 milliseconds, respectively. The elapsed times for sql_ids 95b6t1sp7y40y and ak2wk2sjwnd34 were considerably longer in test.
This is as far as my investigation went. I know that the slowdown is most pronounced on the two queries and I know that their I/O waits correspond to the two wait events. I am still trying to find a way to bring the I/O times down on our test database so that it more closely matches production. But at least I have a more narrow focus with the two top queries and the two wait events.