How to get the last not-null value in an ordered column of a huge table? The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Quickly change NULL column to NOT NULLTSQL change column constraint between null and not nullHow to get the first or last row in ordered result set, depending by column value?Can I use SPARSE somehow on a non-nullable bit column with mostly false values?Columnstore Aggregate Pushdown doesn't work for float/real data typesHow to handle null value number DataWarehouseHow to get 0 in the null value fieldsHOW to work with NULL in a NOT NULL column?How to select the set of last non-NULL values per column over a group?Select last non-null value for given partition - Postgres 10
Is this wall load bearing? Blueprints and photos attached
How does ice melt when immersed in water
Why is the object placed in the middle of the sentence here?
How is simplicity better than precision and clarity in prose?
Empty set is subset of every set? If yes, why that...
How to grep and cut numbes from a file and sum them
Would an alien lifeform be able to achieve space travel if lacking in vision?
Is it ethical to upload a automatically generated paper to a non peer-reviewed site as part of a larger research?
Was credit for the black hole image misattributed?
Hopping to infinity along a string of digits
Is a pteranodon too powerful as a beast companion for a beast master?
Take groceries in checked luggage
Semisimplicity of the category of coherent sheaves?
Can withdrawing asylum be illegal?
What are these Gizmos at Izaña Atmospheric Research Center in Spain?
Match Roman Numerals
Did the UK government pay "millions and millions of dollars" to try to snag Julian Assange?
How did the audience guess the pentatonic scale in Bobby McFerrin's presentation?
How does this infinite series simplify to an integral?
Am I ethically obligated to go into work on an off day if the reason is sudden?
Simulating Exploding Dice
What can I do if neighbor is blocking my solar panels intentionally?
Can a novice safely splice in wire to lengthen 5V charging cable?
How to politely respond to generic emails requesting a PhD/job in my lab? Without wasting too much time
How to get the last not-null value in an ordered column of a huge table?
The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Quickly change NULL column to NOT NULLTSQL change column constraint between null and not nullHow to get the first or last row in ordered result set, depending by column value?Can I use SPARSE somehow on a non-nullable bit column with mostly false values?Columnstore Aggregate Pushdown doesn't work for float/real data typesHow to handle null value number DataWarehouseHow to get 0 in the null value fieldsHOW to work with NULL in a NOT NULL column?How to select the set of last non-NULL values per column over a group?Select last non-null value for given partition - Postgres 10
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have the following input:
id | value
----+-------
1 | 136
2 | NULL
3 | 650
4 | NULL
5 | NULL
6 | NULL
7 | 954
8 | NULL
9 | 104
10 | NULL
I expect the following result:
id | value
----+-------
1 | 136
2 | 136
3 | 650
4 | 650
5 | 650
6 | 650
7 | 954
8 | 954
9 | 104
10 | 104
The trivial solution would be join the tables with a <
relation, and then selecting the MAX
value in a GROUP BY
:
WITH tmp AS (
SELECT t2.id, MAX(t1.id) AS lastKnownId
FROM t t1, t t2
WHERE
t1.value IS NOT NULL
AND
t2.id >= t1.id
GROUP BY t2.id
)
SELECT
tmp.id, t.value
FROM t, tmp
WHERE t.id = tmp.lastKnownId;
However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).
However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.
Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.
Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.
I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.
The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.
sql-server t-sql null window-functions running-totals
add a comment |
I have the following input:
id | value
----+-------
1 | 136
2 | NULL
3 | 650
4 | NULL
5 | NULL
6 | NULL
7 | 954
8 | NULL
9 | 104
10 | NULL
I expect the following result:
id | value
----+-------
1 | 136
2 | 136
3 | 650
4 | 650
5 | 650
6 | 650
7 | 954
8 | 954
9 | 104
10 | 104
The trivial solution would be join the tables with a <
relation, and then selecting the MAX
value in a GROUP BY
:
WITH tmp AS (
SELECT t2.id, MAX(t1.id) AS lastKnownId
FROM t t1, t t2
WHERE
t1.value IS NOT NULL
AND
t2.id >= t1.id
GROUP BY t2.id
)
SELECT
tmp.id, t.value
FROM t, tmp
WHERE t.id = tmp.lastKnownId;
However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).
However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.
Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.
Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.
I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.
The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.
sql-server t-sql null window-functions running-totals
add a comment |
I have the following input:
id | value
----+-------
1 | 136
2 | NULL
3 | 650
4 | NULL
5 | NULL
6 | NULL
7 | 954
8 | NULL
9 | 104
10 | NULL
I expect the following result:
id | value
----+-------
1 | 136
2 | 136
3 | 650
4 | 650
5 | 650
6 | 650
7 | 954
8 | 954
9 | 104
10 | 104
The trivial solution would be join the tables with a <
relation, and then selecting the MAX
value in a GROUP BY
:
WITH tmp AS (
SELECT t2.id, MAX(t1.id) AS lastKnownId
FROM t t1, t t2
WHERE
t1.value IS NOT NULL
AND
t2.id >= t1.id
GROUP BY t2.id
)
SELECT
tmp.id, t.value
FROM t, tmp
WHERE t.id = tmp.lastKnownId;
However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).
However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.
Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.
Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.
I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.
The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.
sql-server t-sql null window-functions running-totals
I have the following input:
id | value
----+-------
1 | 136
2 | NULL
3 | 650
4 | NULL
5 | NULL
6 | NULL
7 | 954
8 | NULL
9 | 104
10 | NULL
I expect the following result:
id | value
----+-------
1 | 136
2 | 136
3 | 650
4 | 650
5 | 650
6 | 650
7 | 954
8 | 954
9 | 104
10 | 104
The trivial solution would be join the tables with a <
relation, and then selecting the MAX
value in a GROUP BY
:
WITH tmp AS (
SELECT t2.id, MAX(t1.id) AS lastKnownId
FROM t t1, t t2
WHERE
t1.value IS NOT NULL
AND
t2.id >= t1.id
GROUP BY t2.id
)
SELECT
tmp.id, t.value
FROM t, tmp
WHERE t.id = tmp.lastKnownId;
However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).
However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.
Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.
Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.
I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.
The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.
sql-server t-sql null window-functions running-totals
sql-server t-sql null window-functions running-totals
edited Apr 3 at 3:17
Paul White♦
54.2k14288461
54.2k14288461
asked Mar 31 at 17:19
peterhpeterh
1,13341532
1,13341532
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:
DROP TABLE IF EXISTS dbo.Example;
CREATE TABLE dbo.Example
(
id integer PRIMARY KEY,
val integer NULL
);
INSERT dbo.Example
(id, val)
VALUES
(1, 136),
(2, NULL),
(3, 650),
(4, NULL),
(5, NULL),
(6, NULL),
(7, 954),
(8, NULL),
(9, 104),
(10, NULL);
SELECT
E.id,
E.val,
lastval =
CAST(
SUBSTRING(
MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
ORDER BY E.id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS integer)
FROM dbo.Example AS E
ORDER BY
E.id;
Demo: db<>fiddle
add a comment |
I expected t-sql to optimize it out - on a block/record level, the
task to do is very easy and linear, essentially a for loop ( O(n) ).
That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.
With the right indexing you can get the algorithm that you seek through the following T-SQL:
SELECT t1.id, ca.[VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t1
CROSS APPLY (
SELECT TOP (1) [VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t2
WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
ORDER BY t2.ID DESC
) ca; --ORDER BY t1.ID ASC
For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]
. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.
It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.
Sample data for the test:
DROP TABLE IF EXISTS #t;
CREATE TABLE #t (
ID BIGINT NOT NULL
);
INSERT INTO #t WITH (TABLOCK)
SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];
CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
ID BIGINT NOT NULL,
[VALUE] BIGINT NULL
);
INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
FROM #t t1
CROSS JOIN #t t2;
CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);
Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.
– peterh
Apr 1 at 2:57
I used a simple naive queries, without any (indexed) temporary tables, withWITH ... AS
constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.
– peterh
Apr 1 at 13:04
@peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?
– Joe Obbish
Apr 1 at 22:49
Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.
– peterh
Apr 2 at 6:11
add a comment |
One method, by using OVER()
and MAX()
and COUNT()
based on this source could be:
SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
FROM
(
SELECT ID, value
,COUNT(value) OVER (ORDER BY ID) AS Value2
FROM dbo.HugeTable
) a
ORDER BY ID;
Result
Id UpdatedValue
1 136
2 136
3 650
4 650
5 650
6 650
7 954
8 954
9 104
10 104
Another method based on this source, closely related to the first example
;WITH CTE As
(
SELECT value,
Id,
COUNT(value)
OVER(ORDER BY Id) As Value2
FROM dbo.HugeTable
),
CTE2 AS (
SELECT Id,
value,
First_Value(value)
OVER( PARTITION BY Value2
ORDER BY Id) As UpdatedValue
FROM CTE
)
SELECT Id,UpdatedValue
FROM CTE2;
3
Consider adding details about how these approaches perform with a "huge table".
– Joe Obbish
Mar 31 at 22:32
Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.
– peterh
Apr 1 at 3:02
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "182"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f233610%2fhow-to-get-the-last-not-null-value-in-an-ordered-column-of-a-huge-table%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:
DROP TABLE IF EXISTS dbo.Example;
CREATE TABLE dbo.Example
(
id integer PRIMARY KEY,
val integer NULL
);
INSERT dbo.Example
(id, val)
VALUES
(1, 136),
(2, NULL),
(3, 650),
(4, NULL),
(5, NULL),
(6, NULL),
(7, 954),
(8, NULL),
(9, 104),
(10, NULL);
SELECT
E.id,
E.val,
lastval =
CAST(
SUBSTRING(
MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
ORDER BY E.id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS integer)
FROM dbo.Example AS E
ORDER BY
E.id;
Demo: db<>fiddle
add a comment |
A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:
DROP TABLE IF EXISTS dbo.Example;
CREATE TABLE dbo.Example
(
id integer PRIMARY KEY,
val integer NULL
);
INSERT dbo.Example
(id, val)
VALUES
(1, 136),
(2, NULL),
(3, 650),
(4, NULL),
(5, NULL),
(6, NULL),
(7, 954),
(8, NULL),
(9, 104),
(10, NULL);
SELECT
E.id,
E.val,
lastval =
CAST(
SUBSTRING(
MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
ORDER BY E.id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS integer)
FROM dbo.Example AS E
ORDER BY
E.id;
Demo: db<>fiddle
add a comment |
A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:
DROP TABLE IF EXISTS dbo.Example;
CREATE TABLE dbo.Example
(
id integer PRIMARY KEY,
val integer NULL
);
INSERT dbo.Example
(id, val)
VALUES
(1, 136),
(2, NULL),
(3, 650),
(4, NULL),
(5, NULL),
(6, NULL),
(7, 954),
(8, NULL),
(9, 104),
(10, NULL);
SELECT
E.id,
E.val,
lastval =
CAST(
SUBSTRING(
MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
ORDER BY E.id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS integer)
FROM dbo.Example AS E
ORDER BY
E.id;
Demo: db<>fiddle
A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:
DROP TABLE IF EXISTS dbo.Example;
CREATE TABLE dbo.Example
(
id integer PRIMARY KEY,
val integer NULL
);
INSERT dbo.Example
(id, val)
VALUES
(1, 136),
(2, NULL),
(3, 650),
(4, NULL),
(5, NULL),
(6, NULL),
(7, 954),
(8, NULL),
(9, 104),
(10, NULL);
SELECT
E.id,
E.val,
lastval =
CAST(
SUBSTRING(
MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
ORDER BY E.id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS integer)
FROM dbo.Example AS E
ORDER BY
E.id;
Demo: db<>fiddle
answered Apr 1 at 0:30
Paul White♦Paul White
54.2k14288461
54.2k14288461
add a comment |
add a comment |
I expected t-sql to optimize it out - on a block/record level, the
task to do is very easy and linear, essentially a for loop ( O(n) ).
That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.
With the right indexing you can get the algorithm that you seek through the following T-SQL:
SELECT t1.id, ca.[VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t1
CROSS APPLY (
SELECT TOP (1) [VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t2
WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
ORDER BY t2.ID DESC
) ca; --ORDER BY t1.ID ASC
For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]
. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.
It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.
Sample data for the test:
DROP TABLE IF EXISTS #t;
CREATE TABLE #t (
ID BIGINT NOT NULL
);
INSERT INTO #t WITH (TABLOCK)
SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];
CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
ID BIGINT NOT NULL,
[VALUE] BIGINT NULL
);
INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
FROM #t t1
CROSS JOIN #t t2;
CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);
Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.
– peterh
Apr 1 at 2:57
I used a simple naive queries, without any (indexed) temporary tables, withWITH ... AS
constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.
– peterh
Apr 1 at 13:04
@peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?
– Joe Obbish
Apr 1 at 22:49
Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.
– peterh
Apr 2 at 6:11
add a comment |
I expected t-sql to optimize it out - on a block/record level, the
task to do is very easy and linear, essentially a for loop ( O(n) ).
That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.
With the right indexing you can get the algorithm that you seek through the following T-SQL:
SELECT t1.id, ca.[VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t1
CROSS APPLY (
SELECT TOP (1) [VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t2
WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
ORDER BY t2.ID DESC
) ca; --ORDER BY t1.ID ASC
For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]
. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.
It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.
Sample data for the test:
DROP TABLE IF EXISTS #t;
CREATE TABLE #t (
ID BIGINT NOT NULL
);
INSERT INTO #t WITH (TABLOCK)
SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];
CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
ID BIGINT NOT NULL,
[VALUE] BIGINT NULL
);
INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
FROM #t t1
CROSS JOIN #t t2;
CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);
Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.
– peterh
Apr 1 at 2:57
I used a simple naive queries, without any (indexed) temporary tables, withWITH ... AS
constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.
– peterh
Apr 1 at 13:04
@peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?
– Joe Obbish
Apr 1 at 22:49
Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.
– peterh
Apr 2 at 6:11
add a comment |
I expected t-sql to optimize it out - on a block/record level, the
task to do is very easy and linear, essentially a for loop ( O(n) ).
That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.
With the right indexing you can get the algorithm that you seek through the following T-SQL:
SELECT t1.id, ca.[VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t1
CROSS APPLY (
SELECT TOP (1) [VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t2
WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
ORDER BY t2.ID DESC
) ca; --ORDER BY t1.ID ASC
For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]
. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.
It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.
Sample data for the test:
DROP TABLE IF EXISTS #t;
CREATE TABLE #t (
ID BIGINT NOT NULL
);
INSERT INTO #t WITH (TABLOCK)
SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];
CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
ID BIGINT NOT NULL,
[VALUE] BIGINT NULL
);
INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
FROM #t t1
CROSS JOIN #t t2;
CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);
I expected t-sql to optimize it out - on a block/record level, the
task to do is very easy and linear, essentially a for loop ( O(n) ).
That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.
With the right indexing you can get the algorithm that you seek through the following T-SQL:
SELECT t1.id, ca.[VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t1
CROSS APPLY (
SELECT TOP (1) [VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t2
WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
ORDER BY t2.ID DESC
) ca; --ORDER BY t1.ID ASC
For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]
. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.
It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.
Sample data for the test:
DROP TABLE IF EXISTS #t;
CREATE TABLE #t (
ID BIGINT NOT NULL
);
INSERT INTO #t WITH (TABLOCK)
SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];
CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
ID BIGINT NOT NULL,
[VALUE] BIGINT NULL
);
INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
FROM #t t1
CROSS JOIN #t t2;
CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);
answered Mar 31 at 22:31
Joe ObbishJoe Obbish
21.9k43291
21.9k43291
Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.
– peterh
Apr 1 at 2:57
I used a simple naive queries, without any (indexed) temporary tables, withWITH ... AS
constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.
– peterh
Apr 1 at 13:04
@peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?
– Joe Obbish
Apr 1 at 22:49
Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.
– peterh
Apr 2 at 6:11
add a comment |
Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.
– peterh
Apr 1 at 2:57
I used a simple naive queries, without any (indexed) temporary tables, withWITH ... AS
constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.
– peterh
Apr 1 at 13:04
@peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?
– Joe Obbish
Apr 1 at 22:49
Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.
– peterh
Apr 2 at 6:11
Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.
– peterh
Apr 1 at 2:57
Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.
– peterh
Apr 1 at 2:57
I used a simple naive queries, without any (indexed) temporary tables, with
WITH ... AS
constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.– peterh
Apr 1 at 13:04
I used a simple naive queries, without any (indexed) temporary tables, with
WITH ... AS
constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.– peterh
Apr 1 at 13:04
@peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?
– Joe Obbish
Apr 1 at 22:49
@peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?
– Joe Obbish
Apr 1 at 22:49
Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.
– peterh
Apr 2 at 6:11
Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.
– peterh
Apr 2 at 6:11
add a comment |
One method, by using OVER()
and MAX()
and COUNT()
based on this source could be:
SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
FROM
(
SELECT ID, value
,COUNT(value) OVER (ORDER BY ID) AS Value2
FROM dbo.HugeTable
) a
ORDER BY ID;
Result
Id UpdatedValue
1 136
2 136
3 650
4 650
5 650
6 650
7 954
8 954
9 104
10 104
Another method based on this source, closely related to the first example
;WITH CTE As
(
SELECT value,
Id,
COUNT(value)
OVER(ORDER BY Id) As Value2
FROM dbo.HugeTable
),
CTE2 AS (
SELECT Id,
value,
First_Value(value)
OVER( PARTITION BY Value2
ORDER BY Id) As UpdatedValue
FROM CTE
)
SELECT Id,UpdatedValue
FROM CTE2;
3
Consider adding details about how these approaches perform with a "huge table".
– Joe Obbish
Mar 31 at 22:32
Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.
– peterh
Apr 1 at 3:02
add a comment |
One method, by using OVER()
and MAX()
and COUNT()
based on this source could be:
SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
FROM
(
SELECT ID, value
,COUNT(value) OVER (ORDER BY ID) AS Value2
FROM dbo.HugeTable
) a
ORDER BY ID;
Result
Id UpdatedValue
1 136
2 136
3 650
4 650
5 650
6 650
7 954
8 954
9 104
10 104
Another method based on this source, closely related to the first example
;WITH CTE As
(
SELECT value,
Id,
COUNT(value)
OVER(ORDER BY Id) As Value2
FROM dbo.HugeTable
),
CTE2 AS (
SELECT Id,
value,
First_Value(value)
OVER( PARTITION BY Value2
ORDER BY Id) As UpdatedValue
FROM CTE
)
SELECT Id,UpdatedValue
FROM CTE2;
3
Consider adding details about how these approaches perform with a "huge table".
– Joe Obbish
Mar 31 at 22:32
Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.
– peterh
Apr 1 at 3:02
add a comment |
One method, by using OVER()
and MAX()
and COUNT()
based on this source could be:
SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
FROM
(
SELECT ID, value
,COUNT(value) OVER (ORDER BY ID) AS Value2
FROM dbo.HugeTable
) a
ORDER BY ID;
Result
Id UpdatedValue
1 136
2 136
3 650
4 650
5 650
6 650
7 954
8 954
9 104
10 104
Another method based on this source, closely related to the first example
;WITH CTE As
(
SELECT value,
Id,
COUNT(value)
OVER(ORDER BY Id) As Value2
FROM dbo.HugeTable
),
CTE2 AS (
SELECT Id,
value,
First_Value(value)
OVER( PARTITION BY Value2
ORDER BY Id) As UpdatedValue
FROM CTE
)
SELECT Id,UpdatedValue
FROM CTE2;
One method, by using OVER()
and MAX()
and COUNT()
based on this source could be:
SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
FROM
(
SELECT ID, value
,COUNT(value) OVER (ORDER BY ID) AS Value2
FROM dbo.HugeTable
) a
ORDER BY ID;
Result
Id UpdatedValue
1 136
2 136
3 650
4 650
5 650
6 650
7 954
8 954
9 104
10 104
Another method based on this source, closely related to the first example
;WITH CTE As
(
SELECT value,
Id,
COUNT(value)
OVER(ORDER BY Id) As Value2
FROM dbo.HugeTable
),
CTE2 AS (
SELECT Id,
value,
First_Value(value)
OVER( PARTITION BY Value2
ORDER BY Id) As UpdatedValue
FROM CTE
)
SELECT Id,UpdatedValue
FROM CTE2;
edited Apr 1 at 13:13
answered Mar 31 at 17:54
Randi VertongenRandi Vertongen
4,8211924
4,8211924
3
Consider adding details about how these approaches perform with a "huge table".
– Joe Obbish
Mar 31 at 22:32
Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.
– peterh
Apr 1 at 3:02
add a comment |
3
Consider adding details about how these approaches perform with a "huge table".
– Joe Obbish
Mar 31 at 22:32
Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.
– peterh
Apr 1 at 3:02
3
3
Consider adding details about how these approaches perform with a "huge table".
– Joe Obbish
Mar 31 at 22:32
Consider adding details about how these approaches perform with a "huge table".
– Joe Obbish
Mar 31 at 22:32
Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.
– peterh
Apr 1 at 3:02
Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.
– peterh
Apr 1 at 3:02
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f233610%2fhow-to-get-the-last-not-null-value-in-an-ordered-column-of-a-huge-table%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown