“TempDB is full” error: solving interesting SQL Server issue.

Straight to the point: this week I’ve faced an interesting SQL issue. Few words about environment where it has occurred:

  • Enterprise system with SQL Server 2005 databases.
  • All changes made by users are logged using internal mechanism to history tables. The mechanism looks like Change Data Capture, but this possibility is available only from SQL Server 2008.
  • Old history data is transferred to separate server (disk space constraints on main server). We work with it using SQL linked server.
  • For special reports history data is put down into flat tables on everyday basis. This is done inside stored procedures launched by SQL Server jobs. Only records updated till last execution of the job are proceeding.

So, the situation was the following: I’ve implemented new flat table for an object that has about 5-6 millions of records in history tables. First job execution (when it has to process data for the whole period) failed with the error:

Could not allocate space for object ‘dbo.WORKFILE GROUP large record overflow storage:  147184613523456’ in database ‘tempdb’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

Really bad start, as it also led to ten-minute server downtime. So I came to the server, reduce period for which data must be collected inside the procedure (only current year), and launched the procedure. It works successfully for about 40 minutes (during working hours) with no influence on users. I caught some missed indexes, made some small other optimizations and reduce procedure work to 20 minutes for a year period, and 1-2 minutes for a single day period.

With clear conscience I went home to receive night notification that server was down again. I could not believe it, the procedure had to process only one day changes (about 4000 records). In the afternoon it was doing this for 1-2 minutes, but at night it was working for 30 minutes and failed with the same error: “Could not allocate space for object ‘dbo.WORKFILE GROUP large record overflow storage:  147184613523456’ in database ‘tempdb’ because the ‘PRIMARY’ filegroup is full”.

It looked like a challenge. At first, I started again the stored procedure: and it was successful with duration 102 seconds, and amount of reads about 3 millions! Everything was ok. It was a kind of magic. How can 4000 records use about “147184613523456 bytes” in tempdb? It was clear, that turning on autogrow for TempDB would not solve the problem.

Trying to find any root cause of such server behavior I decided not to launch stored procedure by myself but to launch it through SQL Server job. And that was it – I could repeat it. I stopped it after 10 minutes and its results were discouraging: the amount of reads about 21 millions!

It was a big step forward: the issue still was mysterious but it was repeatable. So, it came time to log procedure work. The easiest way to do so (from my point of view), is to create table and insert into it some marks that indicates what is going inside.

create table dbo.__TableForLoggin (
       ID int IDENTITY, — Identity, you can order by it to see right sequence of events
StatusInfo varchar(100), — Field for any text marks.
 EventDateTime datetime — Field to store date time when event occurred. Can be helpful to calculate step duration.
)

After each step of the procedure, I added an insert into __TableForLogging. So I was ready for debugging. To verify that logging was working correctly I performed manual execution of the procedure. It was successful and logging also worked perfectly.

It was time to find problem and I did it – it was in the flowing query:

update #p — temp table in which I prepare data for cursor processing
set NotesJournal = tab1.NotesJournal,
OrderModificationsJournal = tab1.OrderModificationsJournal
from (
 — History table in main database
select ohh.VerDateTime, ohh.ID, ohh.NotesJournal, ohh.OrderModificationsJournal
       from dbo.Doc_OrderHead_History ohh with(nolock)
 
       union all

   — History table in history database (linked server)
select ohh.VerDateTime, ohh.ID, ohh.NotesJournal, ohh.OrderModificationsJournal
       from TMSHistory.TMSHistory.dbo.Doc_OrderHead_History ohh with(nolock)
) tab1
where #p.OrderID = tab1.ID and #p.OrderStatusEndDate = tab1.VerDateTime

From the first point of view it was quite common: the last of the same updates and the only difference was that it had filter on the field tab1.VerDateTime and fields NotesJournal, OrderModificationsJournal were nvarchar(MAX). Also it was rather strange that it works normally when I started procedure myself, by caused terrible influence on server when stared from SQL job.

So it came time for some googling. But the majority of advises was to enlarge space for tempdb and turn on autrogrow. The idea what to try was found on the link: http://social.msdn.microsoft.com/Forums/en-CA/sqldatabaseengine/thread/5c477375-b266-4632-ad44-486f844c3b15 – to get avoid of the union operator.

Why not? So I rewrote the query the following way:

——————————————————
— History table in main database
update #p
set NotesJournal = tab1.NotesJournal,
OrderModificationsJournal = tab1.OrderModificationsJournal
from (
      select ohh.VerDateTime, ohh.ID, ohh.NotesJournal, ohh.OrderModificationsJournal
      from dbo.Doc_OrderHead_History ohh with(nolock)
) tab1
where #p.OrderID = tab1.ID and #p.OrderStatusEndDate = tab1.VerDateTime

——————————————————
— History table in history database (linked server)
update #p
set NotesJournal = tab1.NotesJournal,
OrderModificationsJournal = tab1.OrderModificationsJournal
from (
      select ohh.VerDateTime, ohh.ID, ohh.NotesJournal, ohh.OrderModificationsJournal
      from TMSHistory.TMSHistory.dbo.Doc_OrderHead_History ohh with(nolock)
) tab1
where #p.OrderID = tab1.ID and #p.OrderStatusEndDate = tab1.VerDateTime

And it was really it – the solution. Job worked out in 40 seconds with amount of reads about 2.5 million. I corrected all other updates breaking unions into different queries and saved another 10 seconds. So the final result of the whole procedure was 2.4 million of reads and 31 seconds of work.

After that it successfully processed all records for the whole period. It was a real victory, I was satisfied with the results achieved and the only question was: why before optimization it works great when started manually and failed when started from SQL job?

Anу ideas?

PS^

In the discussion below there was found root cause of such server behavior. SQL Jobs are enrolled in transactions to support a rollback in case of failure. I’ve issued a begin trans/commit to the manual script and have faced the same situation. Thanks to Sebastien Chandonnet.

PS-PS^

I’ve also made some more optimizations to the query. I’ve created one more temp table and put in to it using two selects all needed information from both servers. And in the series of updates, I use new temp table. I’ve reduced amount of reads to 500 thousands and duration to 10 seconds.

————————————————————————–
create table #p1 (
ID int,
VerDateTime datetime,
TStatusID int,
NotesJournal nvarchar(max),
OrderModificationsJournal nvarchar(max),
VerUserID int
)
insert into #p1 (ID, VerDateTime, TStatusID, NotesJournal, OrderModificationsJournal, VerUserID)
select ohh.ID, ohh.VerDateTime, ohh.TStatusID, ohh.NotesJournal, ohh.OrderModificationsJournal, ohh.VerUserID
from dbo.Doc_OrderHead_History ohh with(nolock)
where ohh.ID in (select p.OrderID from #p p)
insert into #p1 (ID, VerDateTime, TStatusID, NotesJournal, OrderModificationsJournal, VerUserID)
select ohh.ID, ohh.VerDateTime, ohh.TStatusID, ohh.NotesJournal, ohh.OrderModificationsJournal, ohh.VerUserID
from TMSHistory.TMSHistory.dbo.Doc_OrderHead_History ohh with(nolock)
where ohh.ID in (select p.OrderID from #p p)

————————————————————————–
— Single update for all records
update #p
set NotesJournal = tab1.NotesJournal,
OrderModificationsJournal = tab1.OrderModificationsJournal
from (
select ohh.VerDateTime, ohh.ID, ohh.NotesJournal, ohh.OrderModificationsJournal
from #p1 ohh
) tab1
where #p.OrderID = tab1.ID and #p.OrderStatusEndDate = tab1.VerDateTime

Advertisements
This entry was posted in Database development, SQL Server, Technology and tagged , , . Bookmark the permalink.

3 Responses to “TempDB is full” error: solving interesting SQL Server issue.

  1. My guess : SQL Jobs are enrolled in transactions to support a rollback in case of failure. Issue a begin trans/commit to your manual script, you should face the same situation.

    Now a good idea is to move your tempDB where you have enough space. Easy steps : http://support.microsoft.com/kb/224071

  2. Boris Frolov says:

    Thank you a lot for your comment. I’ll try this tomorrow and let you know the results.

    Concerning TempDB – we have a real issue with free space on production servers. It’s the constraint we will not overcome this year – it’s a long and sad story (((

  3. Boris Frolov says:

    Sebastien, you was right. I’ve corrected the post – adding description of the root cause. Many thanks for your advise.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s