AWS RDS Aurora PostgreSQL Masking Functions

February 1, 2024

Protecting your data is important, both from people who want to get it, and more importantly from your own staff who are only-human, and thus subject to making mistakes.

Using Production data can be very useful for all kinds of debugging and testing. But you don’t want to expose confidential production data to staff that do not need to see it.

Masking is useful as you export data from Production and load it into your test/debug environment.

Below are a few PostgreSQL functions to get you started…

Thanks to Szymon Lipiński for useful info on this topic.

In general, NULL and zero-length string inputs are returned unchanged, otherwise a randomized string of the same length as the original is returned. PARALLEL RESTRICTED for those that use random(). md5 is used over a single random() call, to get 32 alphanumeric characters from that single random() call.

Update Mar 19 2024: Functions I actually used provided.

-- generate a random number, will be a 0.[16 digits] decimal
-- cast to text, then used as bytea intput to md5 function
-- md5 will output 32 alpha numeric characters
-- do this as many times as you need to get at least the desired length
-- joining here with space in between
-- trim back to desired length at the end
-- we need the greatest wrapper otherwise generate_series returns null
-- which on an empty string causes the output to be null, we need empty
-- string input to be empty string output
CREATE OR REPLACE FUNCTION mask_alpha_num_space(orig TEXT)
  RETURNS TEXT
  LANGUAGE SQL
  RETURNS NULL ON NULL INPUT
  PARALLEL RESTRICTED -- because random
AS
$$
SELECT substring(
         (SELECT string_agg(md5(random()::TEXT), ' ')
          FROM generate_series(1, GREATEST(1, GREATEST(1, CEIL(length(orig) / 32.)::integer)))),
         1,
         length(orig));
$$;

-- same as above, but with join of empty string so no spaces
CREATE OR REPLACE FUNCTION mask_alpha_num(orig TEXT)
  RETURNS TEXT
  LANGUAGE SQL
  RETURNS NULL ON NULL INPUT
  PARALLEL RESTRICTED -- because random
AS
$$
SELECT substring(
         (SELECT string_agg(md5(random()::TEXT), '')
          FROM generate_series(1, GREATEST(1, CEIL(length(orig) / 32.)::integer))),
         1,
         length(orig));
$$;

-- generate a random number, will be a 0.[16 digits] decimal
-- cast to text, then get the 16 digits after the decimal
-- do again and again if need longer
-- substring at the end to the desired length
CREATE OR REPLACE FUNCTION mask_num(orig TEXT)
  RETURNS TEXT
  LANGUAGE SQL
  RETURNS NULL ON NULL INPUT
  PARALLEL RESTRICTED -- because random
AS
$$
SELECT substring(
         (SELECT string_agg(substring(random()::TEXT, 3), '')
          FROM generate_series(1, GREATEST(1, CEIL(length(orig) / 16.)::integer))),
         1,
         length(orig))
$$;

-- same as above but here we have the length desired
CREATE OR REPLACE FUNCTION mask_num(len INTEGER)
  RETURNS TEXT
  LANGUAGE SQL
  RETURNS NULL ON NULL INPUT
  PARALLEL RESTRICTED -- because random
AS
$$
SELECT substring(
         (SELECT string_agg(substring(random()::TEXT, 3), '')
          FROM generate_series(1, GREATEST(1, CEIL(len / 16.)::integer))),
         1,
         len)
$$;

-- a random 3 digit + random 10 alphanumeric + ' St'
-- leave the input even though unused so that if the
-- input is null then PG won't bother calling function
CREATE OR REPLACE FUNCTION mask_street(orig TEXT)
  RETURNS TEXT
  LANGUAGE SQL
  RETURNS NULL ON NULL INPUT
  PARALLEL RESTRICTED -- because random
AS
$$
SELECT substring(random()::TEXT, 3, 3)
         || ' '
         || substring(md5(random()::TEXT), 1, 10)
         || ' St';
$$;

-- create a base64 encoded sha256 has of the input
CREATE OR REPLACE FUNCTION mask_hash(inp TEXT)
  RETURNS TEXT
  LANGUAGE SQL
  RETURNS NULL ON NULL INPUT
  PARALLEL SAFE
AS
$$
SELECT encode(sha256(convert_to(inp, 'UTF8')), 'base64');
$$;

-- generate a random email address
-- leave the unused input so PG will skip calling function if it is null
CREATE OR REPLACE FUNCTION mask_email(str TEXT)
  RETURNS TEXT
  LANGUAGE SQL
  RETURNS NULL ON NULL INPUT
  PARALLEL RESTRICTED -- because random
AS
$$
SELECT substring(md5(random()::TEXT), 1, 10) || '@example.com';
$$;

-- if the flag is FALSE then return the orig string as is
-- otherwise mask the string
-- this function DOES NOT return null on null input
CREATE OR REPLACE FUNCTION mask_alpha_num_flagged(str TEXT, bool BOOLEAN)
  RETURNS TEXT
  LANGUAGE SQL
  PARALLEL RESTRICTED -- because random
AS
$$
SELECT CASE
         WHEN str IS NULL OR NOT bool THEN
           str
         ELSE
           substring(
             (SELECT string_agg(md5(random()::TEXT), '')
              FROM generate_series(1, GREATEST(1, CEIL(length(str) / 32.)::integer))),
             1,
             length(str))
         END;
$$;