Functional indexing

   I’ve decided to recap everything that I’ve know about SQL indexes by this moment, and the book USE THE INDEX, LUKE! by Markus Winand looks perfect target which could guide me through this journey.
   In this post I’ve dive a bit into functional indexes - a special type of indexes, which are used for the cases when you have arithmetic function in where operator instead of static statement.
   So, here is an experiment.
   Let’s create heavy table using the following SQL

  
CREATE TABLE scale_data (
section NUMERIC NOT NULL,
id1     NUMERIC NOT NULL,
id2     NUMERIC NOT NULL
);

INSERT INTO scale_data
SELECT sections.*, gen.*
, CEIL(RANDOM()*100)
FROM GENERATE_SERIES(1, 300)     sections,
GENERATE_SERIES(1, 900000) gen
WHERE gen <= sections * 3000;

and check how much time will take simple select with where operator:

explain analyze select id1 from scale_data sd where id1 > 899990;

Gather  (cost=1000.00..1567682.25 rows=13545 width=6) 
(actual time=15693.038..15710.051 rows=10 loops=1)
...
->  Parallel Seq Scan on scale_data sd  (cost=0.00..1565327.75 rows=5644 width=6) 
(actual time=15661.368..15661.369 rows=3 loops=3)
Filter: (id1 > '899990'::numeric)
Rows Removed by Filter: 45149997
...
Execution Time: 16073.845 ms

Ok, ~16s.
Now, let’s add simple index to the id1 column and re-check execution time again - 0.052ms, which is fine, because as we know, indexes improve performance at the cost of additional space usage. In our case table occupies ~50% more space(9.4G compared with 6.6G without index).
But what will happen, if we use an arithmetic function in where clause ?

  
explain analyze select id1 from scale_data sd where id1-5 > 899990;
Seq Scan on scale_data sd  (cost=0.00..2891609.00 rows=45150000 width=6) 
(actual time=73.147..36141.529 rows=5 loops=1)
...
Execution Time: 36142.316 ms

Hm, execution time increased by 2 times, and its clear why: id1x cannot be used anymore because in where we have arithmetic function, so time-consuming Seq Scan is performed.
Fortunately, PostgreSQL have functional indexes, when index can be created not for clear column name, but for function. Let’s try it.

create index id1x_m_5 on scale_data ((id1-5))

Although query takes now 25s instead of 36s(which is unexpected improvement for me), we see that index is not used during query execution.
Additionally, with the 2nd index id1x_minus_5 table has started to occupy 12G, which is 2 times increase compared to initial value.

Seq Scan on scale_data sd  (cost=0.00..2891609.00 rows=45150000 width=6)
(actual time=25389.553..25389.554 rows=5 loops=1)
...
Execution Time: 25389.815 ms

My guess is that it may happen because of > operator. Let’s change it to = and re-check.

Gather  (cost=13689.26..1718978.07 rows=677250 width=6) 
(actual time=25.943..26.046 rows=1 loops=1)
...
->  Parallel Bitmap Heap Scan on scale_data sd  
(cost=12689.26..1650253.07 rows=282188 width=6) 
(actual time=8.604..8.606 rows=0 loops=3)
Recheck Cond: ((id1 - '5'::numeric) = '899990'::numeric)
Heap Blocks: exact=1
->  Bitmap Index Scan on id1x_minus_5  (cost=0.00..12519.94 rows=677250 width=0) 
(actual time=0.018..0.018 rows=1 loops=1)
Index Cond: ((id1 - '5'::numeric) = '899990'::numeric)
...
Execution Time: 26.384 ms

Execution time dropped to 26ms and explain analyze is showing that index was scanned.
But we still need to have > and not = , so how should we act here ?
There is one more type of indexes, which can cover only selected range of values - partial indexes.
So, why not to try to mix them with functional ones?

create index id1x_m_5_partial on scale_data ((id1-5)) where (id1-5) > 899990

Bingo!

Bitmap Heap Scan on scale_data sd  (cost=11295.66..2851286.60 rows=45150000 width=6) 
(actual time=26.890..26.897 rows=5 loops=1)
Recheck Cond: ((id1 - '5'::numeric) > '899990'::numeric)
Heap Blocks: exact=1
->  Bitmap Index Scan on id1_m_5_partial  (cost=0.00..8.16 rows=45150000 width=0) 
(actual time=0.010..0.010 rows=5 loops=1)
...
Execution Time: 27.140 ms

Execution time dropped to 27ms at the cost of specifying exact range which we’re going to use in where clause. How about disc space utilization ?
I’ve dropped id1x and id1x_minus_5 indexes, and table size is back to initial value which shows that partial index are very compact.

Conclusion.

   1) Functional indexes are helpful in narrow specific cases when we have to use expressions in where operator.    2) They are not a silver bullet and in some cases(like with >) might just not work.
   3) It’s possible to mix them with partial index for gaining flexibility.
   4) It’s very good practice to define area of index usage precisely, because(in general) it helps to select appropriate type of index and thus to save disc space and improve time execution.

Keep indexing! See you!