PostgreSQL Export ntile() bucketing

March 19, 2024

This post concludes recent work and commentary on exporting masked data from one AWS Aurora PostgreSQL environment to another:

We were able to export masked data successfully and import it without issue. The export is significantly slower than the import, at least in our case where we needed many of the exports to have joins in order to restrict what data was exported.

AWS RDS Aurora PostgreSQL has special features to contain the low memory situation and cancel the offending query rather than have the server hit OOM and crash, but the problem of being unable to export the data remains.

In order to avoid the dreaded OOM killer that can occur when an OLAP style query runs awry, it is important to break up large exports into smaller parts. Useful also to run them in parallel.

The PostgreSQL ntile() function is useful here. Basically given a table and a set of columns (ideally a unique index), the ntile function creates a (provided) number of buckets of rows on the table. Each bucket of roughly equal size. The column values for the columns are provided so that you can then query for the range of rows in each bucket.

For example, given this simple two column table with a primary key and some sample data:

CREATE TABLE mydata(
  col1 INTEGER NOT NULL,
  col2 INTEGER NOT NULL,
  PRIMARY KEY(col1, col2));

INSERT INTO mydata(col1,col2) VALUES 
(1,1) ,(1,2) ,(1,3) ,(1,4) ,(1,5),
(2,50),(2,40),(2,30),(2,20),(2,10),
(3,5) ,(3,4) ,(3,3) ,(3,2) ,(3,1),
(4,99),(4,98),(4,97),(5,1) ,(5,2);

And say you want to break those 15 rows into 5 roughly equally sized buckets:

SELECT ntile(5) 
  OVER (ORDER BY col1,col2) bucket, col1, col2
FROM mydata ORDER BY bucket, col1, col2;

b	c1	c2
-	-	-
1	1	1
1	1	2
1	1	3
1	1	4
2	1	5
2	2	10
2	2	20
2	2	30
3	2	40
3	2	50
3	3	1
3	3	2
4	3	3
4	3	4
4	3	5
4	4	97
5	4	98
5	4	99
5	5	1
5	5	2

This gives you each row from the table with the bucket that it is a part of. One thing to note in this example is that it is important to think of this as a composite key, not 2 independent columns. Note that the start of bucket 5 is (4,98) - if you simplistically take the minimum of each column in the bucket you would get (4,1), which would be incorrect causing buckets 4 and 5 to overlap.

After much trial and error I arrived at this query which gives you the starting composite key for each bucket:

SELECT DISTINCT ON(bucket) bucket, col1, col2
FROM (
     SELECT  col1, col2
            ,ntile(5) OVER (ORDER BY col1,col2) bucket
     FROM mydata
    ) parts
GROUP BY bucket, col1, col2
ORDER BY bucket, col1, col2;

b	c1	c2
-	-	-
1	1	1
2	1	5
3	2	40
4	3	3
5	4	98

Then when exporting you can use PostreSQL ability to create a virtual row for the composite key. For example to select bucket 4:

SELECT col1, col2
FROM mydata
WHERE (col1, col2) >= (3,3)
  AND (col1, col2) <  (4,98);

For the first bucket use < (start of 2nd bucket), for the last bucket use >= (start of last bucket), and for buckets in between use a combination as in the example above.