A Tree with Tables §

There are plenty of reasons why you’d need to model a tree of data in your database.

Maybe you need to allow your users to organise some of their other data into groups, maybe you want to model inherited properties in your data, maybe you are trying to represent folders in your database, or a ton of other reasons.

In this post I’ll explore if there is any benefit in using the ltree extension for modelling the data you’d normally do with a regular parent/child relationship.

Modelling a Tree §

The usual way to do this is by adding a parent_id column that references the same table’s id column. This way you can still use foreign keys. Each row is a node in the tree:

CREATE TABLE mytree (
  id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  parent_id bigint REFERENCES mytree(id) ON DELETE CASCADE
);

With this we can represent a root row (parent_id IS NULL) or a child row (parent_id IS NOT NULL)

Let’s create a tiny amount of test data:

TRUNCATE mytree RESTART IDENTITY; -- if you already had some values in there
INSERT INTO mytree (parent_id) VALUES (NULL), (NULL), (1), (2), (3), (3), (4), (4), (6);

Now we have the following tree:

1
└─ 3
    ├─ 5
    └─ 6
        └─ 9
2
└─ 4
    ├─ 7
    └─ 8

Querying a Tree §

In Postgres we can query this structure with a recursive CTEs, allowing us to query the entire tree with a single query.

To get a row with id 4 and all of it’s descendants:

WITH RECURSIVE
  mydescendants AS (
    SELECT * FROM mytree WHERE id = 4
    UNION ALL
    SELECT mytree.* FROM mytree, mydescendants WHERE mytree.parent_id = mydescendants.id
  )
SELECT * FROM mydescendants;
-- "id","parent_id"
-- 4,2
-- 7,4
-- 8,4

What if we need to find both descendants and ancestors?
No problem:

WITH RECURSIVE
  mydescendants AS (
    SELECT * FROM mytree WHERE parent_id = 4 -- note that I changed this to not include id=4 itself
    UNION ALL
    SELECT mytree.* FROM mytree, mydescendants WHERE mytree.parent_id = mydescendants.id
  ),
  myancestors AS (
    SELECT * FROM mytree WHERE id = 4
    UNION ALL
    SELECT mytree.* FROM mytree, myancestors WHERE mytree.id = myancestors.parent_id
  )
SELECT * FROM myancestors
UNION ALL
SELECT * FROM mydescendants;
-- "id","parent_id"
-- 4,2
-- 2,NULL
-- 7,4
-- 8,4

We now both have the ancestor 2, the row itself 4, and the decendants 7 and 8.

Using the ltree Extension §

Another way to model trees in postgres is to use the ltree extension. The ltree datatype is specifically designed for

representing labels of data stored in a hierarchical tree-like structure

“Labels” in this context is

a sequence of alphanumeric characters and underscores (for example, in C locale the characters A-Za-z0-9_ are allowed)

The ltree data type provides us with a way to represent and search for data stored in a tree-like structure, so maybe we can use this to model our tree?

Let’s explore.

Modelling a Tree (Again) §

CREATE TABLE myltree (
  path ltree PRIMARY KEY
);

Let’s create some test data. We’ll use the same structure as we did previously.

We’ll use the id column from before as our labels, but note that we could use anything that is a valid label here.

INSERT INTO myltree (path)
VALUES ('1'), ('2'), ('1.3'), ('2.4'), ('1.3.5'),
       ('1.3.6'), ('2.4.7'), ('2.4.8'), ('1.3.6.9');

Let’s take a look at our new table:

SELECT * FROM myltree;
-- "path"
-- "1"
-- "2"
-- "1.3"
-- "2.4"
-- "1.3.5"
-- "1.3.6"
-- "2.4.7"
-- "2.4.8"
-- "1.3.6.9"

Querying a Tree (Again) §

Let’s query for the same things as earlier.

First, given the node 4 (which is now represented by the path 2.4), how do we get that row and all of it’s descendants?

SELECT * FROM myltree WHERE path <@ '2.4';
-- "path"
-- "2.4"
-- "2.4.7"
-- "2.4.8"

This returned the same rows as earlier.

What if we need to find both descendants and ancestors of the label '4'? Also easy:

SELECT * FROM myltree WHERE path <@ '2.4' OR path @> '2.4';
-- "path"
-- "2"
-- "2.4"
-- "2.4.7"
-- "2.4.8"

The @> operator checks if the left argument is an ancestor of the right argument.
The <@ operator checks if the left argument is a descendant of the right argument.

So the query above says “SELECT all columns FROM myltree WHERE path is a decendant of 2.4 OR path is an ancestor of 2.4”.
These queries are much simpler than before.

What about Performance? §

To detect any kind of performance difference, we’ll need to generate a much larger dataset to test on.

I’ll generate a dataset with 100 root rows, and add 5 child rows for all rows 7 levels deep.

I’m using these two queries to generate our test datasets:

-- myltree dataset
BEGIN;
TRUNCATE TABLE myltree;
CREATE TEMPORARY SEQUENCE ltreeid;
INSERT INTO myltree (path)
WITH RECURSIVE t(path, lvl) AS
(
  SELECT text2ltree(nextval('ltreeid')::text), 1 FROM generate_series(1,100)n
  UNION ALL
  SELECT ltree_addtext(path, nextval('ltreeid')::text), lvl+1 FROM t, generate_series(1,5)n
  WHERE lvl < 7
)
SELECT path FROM t;
DROP SEQUENCE ltreeid;
COMMIT;

which inserted 1,953,100 rows in 32.4 seconds.

and

-- mytree dataset
BEGIN;
TRUNCATE TABLE mytree;
CREATE TEMPORARY SEQUENCE treeid;
INSERT INTO mytree (id, parent_id)
WITH RECURSIVE t(id, parent_id, lvl) AS
(
  SELECT nextval('treeid')::bigint, null::bigint, 1 FROM generate_series(1,100)n
  UNION ALL
  SELECT nextval('treeid')::bigint, t.id, lvl+1 FROM t, generate_series(1,5)n
  WHERE lvl < 7
)
SELECT id, parent_id FROM t;
DROP SEQUENCE treeid;
COMMIT;

which again inserted 1,953,100 rows, this time in 77.2 seconds.

Finding a specific row §

(NOTE: I’m running these with max_parallel_workers_per_gather = 0)

Let’s take a look at the last row we inserted, row 1953100:

SELECT * FROM mytree WHERE id = 1953100;
-- "id","parent_id"
-- 1953100,390600
-- took: 2ms
SELECT * FROM myltree WHERE path ~ '*.1953100';
-- "path"
-- "100.600.3100.15600.78100.390600.1953100"
-- took: 570ms

We can see that both queries return a row where the parent is 390600, so it looks like our datasets are similar.

What the two queries don’t agree on is the time it takes to find a row, 570 ms vs 2 ms.

If you want to take a sidetrack with me and figure out why this is, I’ll dig into indexing over an ltree in this post.

The summary is that when we do WHERE path ~ '*.1953100', that condition can’t use our primary key btree index to find the row, and we end up having to scan the entire table.

Moving forward, I’ll be using a GIST index over path with siglen=124.

Of course, the ID of our ltree table isn’t just 1953100, so this isn’t a fair comparison. The ID of the myltree table is the full path, so let’s try and use that:

SELECT * FROM myltree WHERE path = '100.600.3100.15600.78100.390600.1953100';
-- "path"
-- "100.600.3100.15600.78100.390600.1953100"
-- took: 2ms

So if we have the full ID, the performance of looking up a row is similar.

ltree allows us to write much simpler queries, but there are also some additional features from using ltree.

Example: Count LCA descendants §

Let’s say we know the id of two rows:

  • 100.600.3100.15600.65600.190600
  • 100.600.3100.15600.78100.390600.1953100

And we are interested in finding the number of descendants contained by the longest common ancestor of those two rows.

With ltree, that’s really easy:

SELECT count(*)
FROM myltree
WHERE lca('100.600.3100.15600.65600.190600', '100.600.3100.15600.78100.390600.1953100') @> path;
-- "count"
-- 156
Example: Find all siblings: §
SELECT * FROM myltree
WHERE subpath('100.600.3100.15600.65600.190600', 0, -1) @> path
AND nlevel('100.600.3100.15600.65600.190600') = nlevel(path);
-- "path"
-- "100.600.3100.15600.65600.128100"
-- "100.600.3100.15600.65600.190600"
-- "100.600.3100.15600.65600.253100"
-- "100.600.3100.15600.65600.315600"
-- "100.600.3100.15600.65600.378100"

Adding a row? §

Adding a row to our mytree table is pretty easy. We need to know the parent we want to create a child for.

INSERT INTO mytree (parent_id) VALUES (3);

Inserting into our myltree table is mostly the same.

However we don’t automatically get a new ID to use for this node. We could make up a value, or we could create a sequence to use each time we create a new row.

Using a sequence is most like the parent/child relationship in mytree so let’s assume we have a sequence available:

CREATE SEQUENCE myltreeseq;

Inserting a new row then becomes

-- '3'::ltree being the parent path
INSERT INTO myltree (path) VALUES ('3'::ltree || text2ltree(nextval('myltreeseq')::text))

There is nothing to protect us against inserting a row where the parent no longer exists. If we wanted to ensure that, we’d need to write a trigger that blocks the insert if no parent exists in the table.

Moving a row? §

In our mytree table, moving a row so that it is parented to another row is trivial:

UPDATE mytree SET parent_id = 3 WHERE id = 4;

The story is a bit different with an ltree.
Since all of our rows are independant of each other, if we need to move a row, we also need to move all of that rows descendants. Otherwise, their paths won’t change.

This is actually a feature of ltree, but in our specific scenario, we are not interested in having a subpath in our path that isn’t also a valid row on it’s own, so we need to take some care to move everything correctly:

-- assuming we want to move 2.4 so that it becomes 1.3.4:
UPDATE myltree SET path = '1.3' || subpath(path, nlevel('2.4') - 1) WHERE path <@ '2.4';

The obvious downside is that we need to update n rows instead of 1. For large trees this might get very expensive.

Hybrid approach §

We like having a single integer id for our row. We also like having a parent_id with a foreign key to that row, such that we can cascade deletes and ensure a parent exists on insert.

But what if we also like the features ltree provides? Could we build a table where we could maintain the relationships between the rows with our parent_id column, but also have a path column to query?

My first idea was a materialized view, but that since we can’t incrementally refresh those in postgres (yet), we’d have to rebuild the entire view every time we wanted to INSERT/UPDATE/DELETE.

My second idea was to write a trigger to maintain the ltree when a row changes. So that’s what I did:

-- hybrid table:
CREATE TABLE hybrid (
  id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  parent_id bigint REFERENCES hybrid(id) ON DELETE CASCADE,
  path ltree
);
-- index on the path ltree
CREATE INDEX path_gist_idx ON hybrid USING GIST (path gist_ltree_ops(siglen=124));

-- trigger to update path if parent_id changes
CREATE FUNCTION trigger_set_path_from_parent() RETURNS TRIGGER AS $$
DECLARE
  parentpath ltree;
  currentnode ltree;
BEGIN
  -- On INSERT, nobody has seen this new row before, so we can safely just insert it
  -- and construct the path from it's parent
  IF TG_OP = 'INSERT' THEN
    IF NEW.parent_id IS NULL THEN
      NEW.path = text2ltree(NEW.id::text);
    ELSE
      SELECT ltree_addtext(path, NEW.id::text) INTO NEW.path FROM hybrid WHERE id = NEW.parent_id;
    END IF;
  -- If UPDATE, when a row's parent changes, we also need to also patch the path of all of the children.
  ELSIF TG_OP = 'UPDATE' AND NEW.parent_id IS DISTINCT FROM OLD.parent_id THEN
    -- we want to move the paths such that all descedants of NEW.id gets their path patched.
    currentnode = text2ltree(NEW.id::text);
    SELECT path into parentpath FROM hybrid WHERE id = NEW.parent_id;
    IF index(parentpath, currentnode) > -1 THEN
      RAISE EXCEPTION 'Circular reference. Tried to set parent to % for row %, but % is already in parent path %',
        NEW.parent_id, NEW.id, NEW.id, parentpath;
    ELSE
      NEW.path = parentpath || currentnode;
    END IF;
    UPDATE hybrid
      SET path = NEW.path || subpath(path, nlevel(OLD.path))
      WHERE path <@ OLD.path AND id <> OLD.id;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER set_tree_path
  BEFORE INSERT OR UPDATE ON hybrid
  FOR EACH ROW
  EXECUTE FUNCTION trigger_set_path_from_parent();

Now, let’s see how long it takes to generate the example dataset from earlier. Note that we now have a trigger that needs to run for each row, and a GIST index that needs to be maintained for the path column:

-- hybrid dataset
BEGIN;
TRUNCATE TABLE hybrid;
CREATE TEMPORARY SEQUENCE hybridid;
INSERT INTO hybrid (id, parent_id)
WITH RECURSIVE t(id, parent_id, lvl) AS
(
  SELECT nextval('hybridid')::bigint, null::bigint, 1 FROM generate_series(1,100)n
  UNION ALL
  SELECT nextval('hybridid')::bigint, t.id, lvl+1 FROM t, generate_series(1,5)n
  WHERE lvl < 7
)
SELECT id, parent_id FROM t;
DROP SEQUENCE hybridid;
COMMIT;
-- took: 229.2s

Ok, so it took quite a bit longer to insert those rows. Let’s test out some queries.

EXPLAIN (VERBOSE, ANALYZE)
SELECT * FROM hybrid WHERE path <@ '11.311.2811.7811.20311';
-- Bitmap Heap Scan on public.hybrid  (cost=17.92..750.15 rows=195 width=78) (actual time=0.120..0.486 rows=31 loops=1)
--   Output: id, parent_id, path
--   Recheck Cond: (hybrid.path <@ '11.311.2811.7811.20311'::ltree)
--   Heap Blocks: exact=31
--   ->  Bitmap Index Scan on path_gist_idx  (cost=0.00..17.88 rows=195 width=0) (actual time=0.093..0.102 rows=31 loops=1)
--         Index Cond: (hybrid.path <@ '11.311.2811.7811.20311'::ltree)
-- Planning Time: 0.243 ms
-- Execution Time: 0.896 ms

And the recursive CTE way:

EXPLAIN (VERBOSE, ANALYZE)
WITH RECURSIVE
  mydescendants AS (
    SELECT * FROM hybrid WHERE id = 20311
    UNION ALL
    SELECT hybrid.* FROM hybrid, mydescendants WHERE hybrid.parent_id = mydescendants.id
  )
SELECT * FROM mydescendants;
-- CTE Scan on mydescendants  (cost=533247.96..533257.98 rows=501 width=48) (actual time=0.079..84677.808 rows=31 loops=1)
--   Output: mydescendants.id, mydescendants.parent_id, mydescendants.path
--   CTE mydescendants
--     ->  Recursive Union  (cost=0.43..533247.96 rows=501 width=78) (actual time=0.061..84677.223 rows=31 loops=1)
--           ->  Index Scan using hybrid_pkey on public.hybrid  (cost=0.43..8.45 rows=1 width=78) (actual time=0.042..0.067 rows=1 loops=1)
--                 Output: hybrid.id, hybrid.parent_id, hybrid.path
--                 Index Cond: (hybrid.id = 20311)
--           ->  Hash Join  (cost=0.33..53322.95 rows=50 width=78) (actual time=11531.735..28225.489 rows=10 loops=3)
--                 Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
--                 Hash Cond: (hybrid_1.parent_id = mydescendants_1.id)
--                 ->  Seq Scan on public.hybrid hybrid_1  (cost=0.00..45998.00 rows=1953100 width=78) (actual time=0.017..14011.599 rows=1953100 loops=3)
--                       Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
--                 ->  Hash  (cost=0.20..0.20 rows=10 width=8) (actual time=85.103..85.110 rows=10 loops=3)
--                       Output: mydescendants_1.id
--                       Buckets: 1024  Batches: 1  Memory Usage: 9kB
--                       ->  WorkTable Scan on mydescendants mydescendants_1  (cost=0.00..0.20 rows=10 width=8) (actual time=84.907..84.994 rows=10 loops=3)
--                             Output: mydescendants_1.id
-- Planning Time: 0.812 ms
-- JIT:
--   Functions: 11
--   Options: Inlining true, Optimization true, Expressions true, Deforming true
--   Timing: Generation 3.194 ms, Inlining 66.380 ms, Optimization 110.841 ms, Emission 77.258 ms, Total 257.673 ms
-- Execution Time: 84805.831 ms

Ouch. 84.8s is a little long to wait for 31 rows. Looking closer, There is a Seq Scan in there returning the majority of our rows. That should have been an Index Scan. I forgot to create an index on parent_id.

CREATE INDEX ON hybrid (parent_id);

And let’s try the above recursive query again

EXPLAIN (VERBOSE, ANALYZE)
WITH RECURSIVE
  mydescendants AS (
    SELECT * FROM hybrid WHERE id = 20311
    UNION ALL
    SELECT hybrid.* FROM hybrid, mydescendants WHERE hybrid.parent_id = mydescendants.id
  )
SELECT * FROM mydescendants;
-- CTE Scan on mydescendants  (cost=1928.42..1938.44 rows=501 width=48) (actual time=0.085..4.217 rows=31 loops=1)
--   Output: mydescendants.id, mydescendants.parent_id, mydescendants.path
--   CTE mydescendants
--     ->  Recursive Union  (cost=0.43..1928.42 rows=501 width=78) (actual time=0.064..3.387 rows=31 loops=1)
--           ->  Index Scan using hybrid_pkey on public.hybrid  (cost=0.43..8.45 rows=1 width=78) (actual time=0.045..0.064 rows=1 loops=1)
--                 Output: hybrid.id, hybrid.parent_id, hybrid.path
--                 Index Cond: (hybrid.id = 20311)
--           ->  Nested Loop  (cost=0.43..191.00 rows=50 width=78) (actual time=0.476..0.858 rows=10 loops=3)
--                 Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
--                 ->  WorkTable Scan on mydescendants mydescendants_1  (cost=0.00..0.20 rows=10 width=8) (actual time=0.009..0.109 rows=10 loops=3)
--                       Output: mydescendants_1.id, mydescendants_1.parent_id, mydescendants_1.path
--                 ->  Index Scan using hybrid_parent_id_idx on public.hybrid hybrid_1  (cost=0.43..19.03 rows=5 width=78) (actual time=0.015..0.026 rows=1 loops=31)
--                       Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
--                       Index Cond: (hybrid_1.parent_id = mydescendants_1.id)
-- Planning Time: 0.223 ms
-- Execution Time: 4.796 ms

4.8ms, that seems more reasonable. It’s still a few times slower than querying by path, but probably not detrimentally so. Depending on your table depth and width, these differences will change as well, so remember to test with a dataset that reflects your domain.

Let’s try an update. We’ll move a single node with 780 descendants:

EXPLAIN (VERBOSE, ANALYZE)
UPDATE hybrid SET parent_id = 101 WHERE id = 1000;
-- Update on public.hybrid  (cost=0.43..8.45 rows=0 width=0) (actual time=54.660..54.685 rows=0 loops=1)
--   ->  Index Scan using hybrid_pkey on public.hybrid  (cost=0.43..8.45 rows=1 width=14) (actual time=0.108..0.144 rows=1 loops=1)
--         Output: '101'::bigint, ctid
--         Index Cond: (hybrid.id = 1000)
-- Planning Time: 0.074 ms
-- Trigger RI_ConstraintTrigger_c_20050 for constraint hybrid_parent_id_fkey: time=0.287 calls=1
-- Trigger set_tree_path: time=53.347 calls=1
-- Execution Time: 55.168 ms

Contrast that with the same query on the table without the path and trigger:

EXPLAIN (VERBOSE, ANALYZE)
UPDATE hybrid SET parent_id = 101 WHERE id = 1000;
-- Update on public.mytree  (cost=0.43..8.45 rows=0 width=0) (actual time=0.616..0.642 rows=0 loops=1)
--   ->  Index Scan using mytree_pkey on public.mytree  (cost=0.43..8.45 rows=1 width=14) (actual time=0.175..0.195 rows=1 loops=1)
--         Output: '101'::bigint, ctid
--         Index Cond: (mytree.id = 1000)
-- Planning Time: 0.558 ms
-- Trigger RI_ConstraintTrigger_c_19896 for constraint mytree_parent_id_fkey: time=0.370 calls=1
-- Execution Time: 1.122 ms

Our hybrid table takes 50 times longer to update in this example.

Finally, let’s take a look at the sizes of our relations:

Regular Table: §
table 82 MB
primary key index 42 MB
parent_id index 22 MB
ltree Table: §
table 177 MB
primary key index 229 MB
path GIST index (siglen=124) 268 MB
Hybrid Table: §
table 207 MB
primary key index 42 MB
parent_id index 22 MB
path GIST index (siglen=124) 250 MB

The ltree tables are significantly larger.

Conclusion §

This was a very unscientific dive into ltree, but I still think we can draw some conclusions from it.

  • ltree is a really cool extension!
  • If we want to exactly mirror the behavior of having a parent_id column, ltree alone is probably not the solution.
  • For a small set of read-heavy workloads, a hybrid approach might be worth investigating, especially if you have queries where you can take advantage of the ltree properties.
  • A pure table with a parent_id and using recursive CTE queries is probably still the best option for most, the added complexity of a hybrid approach, having to deal with triggers, the possibility of bugs in your triggers, etc, is probably not worth it.

There are other usecases for ltree where it would be a better fit.
A category tree where you can fit you categories directly into the constraint of what a label can contain, will probably be really neat.

Uses where a thing can be in many places at once is also a thing that ltree enables, like dependency trees, labels (obviously), etc.

Modelling tags/labels with ltree is something I’d like to explore in the future.