My failed attempts at fixing an obscure Postgres bug

Note: this is an unfinished bug-hunt with no satisfying conclusion. I am documenting what I know so far to hopefully finish solving this later.

Background

When building bm25 indexing for LanternDB, I came across a very obscure Postgres bug (I reported in pgsql-bugs mailing list). The bug causes a valid PL/pgSQL function call to fail, likely because of some faulty cache invalidation code path. I would be very surprised if this has actually affected anyone. if it has, please let me know! But as someone who has a history of obsession with bugs1, I was intrigued and wanted to get to the root of it and fix it.

The bug

The simplest code reproducing the involves two PL/pgSQL polymorphic functions with signatures similar to the one below:

CREATE FUNCTION identity(t anyelement)
RETURNS TABLE ("row" anyelement) AS $$
-- implementation skipped for now
$$ LANGUAGE plpgsql;

The function takes a single argument of polymorphic type and returns a table that has the same row type as the input argument. It is intended to take a table type as an argument and return data that has the same structure as the table. You can read more in Postgres docs about PL/pgSQL polymorphic functions here.

We used PL/pgSQL functions like the one above to take in arbitrary table types and “project” additional computed columns onto them. But we will stick to the simpler identity signiture for bravity.

Below is an API usage example:

SELECT * FROM identity(CAST(NULL as "test_table"));
 id |   vec   |
----+---------+
  1 | {1,2,3} |
  2 | {4,5,6} |
  3 | {7,8,9} |
(3 rows)

where test_table is a table like the one below:

-- Create tables to call the table-polimorphic function on
CREATE TABLE test_table(id SERIAL PRIMARY KEY, vec real[]);
INSERT INTO test_table(vec) VALUES ('{1,2,3}'), ('{4,5,6}'),
('{7,8,9}');

The identity call works flowlessly until the structure of source table changes. When we modify test_table row type by e.g. adding a column:

ALTER TABLE test_table ADD COLUMN vec23 real[];

The identity function stops working:

SELECT * FROM identity(CAST(NULL as "test_table"));
ERROR:  structure of query does not match function result type
DETAIL:  Number of returned columns (2) does not match expected column count (3).
CONTEXT:  SQL statement "SELECT * FROM issue4.identity_internal(t)"
PL/pgSQL function issue4.identity(anyelement) line 3 at RETURN QUERY

Polymorphic types in postgres get instantiated to a concrete type for each call and are not supposed to hang on to the structure of the data they were first called with. It seems in this particular case they misbehave.

Hypothesis

Postgres creates execution plans for queries (including SQL and PL/pgSQL function calls). Postgres caches these to prevent needless re-planning in following calls. Obviously, these plans make some assumptions about queries and data they get in. In some cases multiple plans with various strictness of assumptions also exist. See Postgres Query Planning and read about Postgres custom and generic plans for more.

In all cases, however, the plans have to be properly evaluated against every new call and get invalidated as soon as their assumptions are violated.

Early on I suspected that this is a cache invalidation issue – a PL/pgSQL function plan gets cached with a monomorphosized table signiture it is first called with and later that cache is reused when the function is called with differing raw type.

I noticed that once the bug starts manifesting, creating and dropping a schema makes the bug go away.

CREATE SCHEMA dummy;
DROP SCHEMA dummy;

Now this succeeds!

SELECT * FROM identity(CAST(NULL as "test_table"));
 id |   vec   | vec2
----+---------+------
  1 | {1,2,3} |
  2 | {4,5,6} |
  3 | {7,8,9} |
(3 rows)

Since the dummy schema that we created and immediately deleted does not interact with any other object we have, it is very likely this is a cache invalidation issue.

Root cause search attempts

In many ways this had all the characteristics of an easily traceable and fixample bug. It was reproducible with no randomness 100% of the time, only a few lines of toy example and a minimal dataset was necessary to reproduce it. In my head, I had done the 70% of the work by creating a reproducing example and all that was remaining was to plug gdb, run some backtraces, find where the invalid cached plan is being used and add a patch calling the invalidation function.

But after many hours of Postgres source code exploration, I found almost nothing so far.

I only found one more piece of evidence pointing to invalidation issue:

in the function handling execution of set-returning functions, if we always invalidate previously planned cache, the issue we observed above goes away. Below is a snippet from pl_exec.c file around line `3602:

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index d19425b7a7..dae8bb1ef3 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -3599,6 +3599,8 @@ exec_stmt_return_query(PLpgSQL_execstate *estate,
                /*
                 * On the first call for this expression generate the plan.
                 */
+               // the change always invalidates execution plan and "fixes" our error
+               expr->plan = NULL;
                if (expr->plan == NULL)
                        exec_prepare_plan(estate, expr, CURSOR_OPT_PARALLEL_OK);

Adding plan invalidation on line 3602 forces function calls for set-returning functions such as identity to never use cached plans. After that change the issue went away.

I tried exploring code around it, tracing the code up and down with gdb, and even recorded snapshots of complete process memory before and after the issue was triggered. But in all cases the volume of code and memory changes (~2MB in 1000+ different locations, mostly in heap) were too extensive to enable any progress with the blind-debugging methods I was trying.

I will likely revisit the issue after developing deeper udnerstanding of Postgres caching internals. If you have any tips for how to track down a bug like this without complete understanding of the system at hand (Postgres in this case) or have details about relevant Postgres subsystems involved that could help me track this down, please reach out (email on homepage)!

Full Example to Reproduce the Bug

Below is a complete example that creates the necessary two helper functions with identical signature as above and demonstrates the issue with some data.

Click to expand full SQL code
 -- create everything on an isolated schema
 -- to simplify the cleanup
DROP schema if exists issue4 cascade;
CREATE schema issue4;

-- define the first function
CREATE FUNCTION issue4.identity_internal(t anyelement)
 RETURNS TABLE ("row" anyelement) AS $$
DECLARE
  query_base text;
BEGIN
  query_base := format('SELECT * FROM %s ', pg_typeof(t));
  RETURN QUERY EXECUTE query_base;
END $$ LANGUAGE plpgsql;

-- define the second function
CREATE FUNCTION issue4.identity(t anyelement)
 RETURNS TABLE ("row" anyelement) AS $$
BEGIN
  RETURN QUERY SELECT * FROM issue4.identity_internal(t);
END $$ LANGUAGE plpgsql;

-- Create table to call the table-polimorphic function on
CREATE TABLE issue4.test_table(id SERIAL PRIMARY KEY, vec real[]);
INSERT INTO issue4.test_table(vec) VALUES ('{1,2,3}'), ('{4,5,6}'),
('{7,8,9}');

-- succeeds!
SELECT * FROM issue4.identity(CAST(NULL as "issue4"."test_table"));

--------------------- CORE OF THE ISSUE -----------------------

-- Modifying the table row type makes later calls fail
ALTER TABLE issue4.test_table ADD COLUMN vec2 real[];
-- this fails
SELECT * FROM issue4.identity(CAST(NULL as "issue4"."test_table"));
-- ^^^^^^^^^^^^^^^^^^^^^ CORE OF THE ISSUE ^^^^^^^^^^^^^^^^^^^^^^^

-- creating+deleting an unrelated schema
-- makes the issue go away
CREATE SCHEMA issue4_dummy;
DROP SCHEMA issue4_dummy;

-- This (EXACT SAME QUERY AS ABOVE) now succeeds!?
SELECT * FROM issue4.identity(CAST(NULL as "issue4"."test_table"));

-- cleanup
DROP SCHEMA issue4 CASCADE;

  1. Galstyan, Narek. “Application-Integrated Record-Replay of Distributed Systems.” (2024). https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-4.pdf ↩︎