Cleaning up the database

Often, a database can contain several unused objects or very old data. Cleaning up these objects helps administrators perform a backup of images more quickly. From a development point of view, unused objects are noise because they affect the refactoring process.

In database applications, one needs to keep the database clean as unused database objects might hinder quick development due to those objects' dependencies. To clean the database, one needs to identify the unused database objects, including tables, views, indexes, and functions.

Table statistics, such as the number of live rows, index scans, and sequential scans, can help identify empty and unused tables. Note that the following queries are based on statistics, so the results need to be validated. The pg_stat_user_tables table provides this information, and the following query shows empty tables by checking the number of tuples:

SELECT relname FROM pg_stat_user_tables WHERE n_live_tup= 0;

All information based on PostgreSQL statistics are not one hundred percent bulletproof, because the statistics might not be up to date.

To find the empty columns or unused columns, one can have a look at the null_fraction attribute of the pg_stats table. If the null_fraction equals one, then the column is completely empty:

SELECT schemaname, tablename, attname FROM pg_stats WHERE null_frac= 1 and schemaname NOT IN ('pg_catalog', 'information_schema');

To find the useless indexes, two steps can be applied:

The first technique is to determine whether an index is duplicated or overlapped with another index,
and the second t step is to assess whether the index is used based on the index statistics.

The following query can be used to assess whether an index is used based on catalog statistics. Note that , the constraint indexes--the unique constraints and primary keys--are excluded because we need these indexes even though they are not used:

SELECT schemaname, relname, indexrelname FROM pg_stat_user_indexes s JOIN pg_index i ON s.indexrelid = i.indexrelid WHERE idx_scan=0 AND NOT indisunique AND NOT indisprimary;

Overlapping index attributes can be used to identify duplicate indexes. The following SQL code compares the index attributes with each other and return indexes that overlap attributes:

WITH index_info AS (
SELECT 
    pg_get_indexdef(indexrelid) AS index_def, 
    indexrelid::regclass
    index_name , 
    indrelid::regclass table_name, array_agg(attname order by attnum) AS index_att
 FROM
     pg_index i JOIN
     pg_attribute a ON i.indexrelid = a.attrelid
GROUP BY 
    pg_get_indexdef(indexrelid), indrelid, indexrelid
 ) SELECT DISTINCT
    CASE WHEN a.index_name > b.index_name THEN a.index_def ELSE b.index_def END AS index_def,
    CASE WHEN a.index_name > b.index_name THEN a.index_name ELSE b.index_name END AS index_name,
    CASE WHEN a.index_name > b.index_name THEN b.index_def ELSE a.index_def END AS overlap_index_def,
    CASE WHEN a.index_name > b.index_name THEN b.index_name ELSE a.index_name END AS overlap_index_name,
    a.index_att = b.index_att as full_match,
    a.table_name
 FROM
     index_info a INNER JOIN 
    index_info b ON (a.index_name != b.index_name AND a.table_name = b.table_name AND a.index_att && b.index_att );

To test the preceding query, let us create an overlapping index as follows and execute the query :

CREATE TABLE test_index_overlap(a int, b int);
CREATE INDEX ON test_index_overlap (a,b);
CREATE INDEX ON test_index_overlap (b,a);

The result of the preceding query is as follows:

-[ RECORD 1 ]------+--------------------------------------------------------------------------------------------------------------------------------
index_def | CREATE INDEX test_index_overlap_b_a_idx ON test_index_overlap USING btree (b, a)
index_name | test_index_overlap_b_a_idx
overlap_index_def | CREATE INDEX test_index_overlap_a_b_idx ON test_index_overlap USING btree (a, b)
overlap_index_name | test_index_overlap_a_b_idx
full_match | f
table_name | test_index_overlap

Cleaning up unused views and functions is a little bit tricky. By default, PostgreSQL collects statistics about indexes and tables but not functions. To enable statistics collection on functions, the track_functions setting needs to be enabled. The statistics on functions can be found in the pg_stat_user_functions table.

For views, there are no statistics collected unless the views are materialized. In order to assess whether a view is used, we need to do this manually. This can be done by rewriting the view and joining it with a function with a certain side effect, the joined function for example update a table and increase the number of times the view is accessed, or raise a certain log message. To test this technique, let's write a simple function that raises a log, as follows:

CREATE OR REPLACE FUNCTION monitor_view_usage (view_name TEXT) RETURNS BOOLEAN AS $$
BEGIN
    RAISE LOG 'The view % is used on % by % ', view_name, current_time, session_user;
    RETURN TRUE;
END;
$$LANGUAGE plpgsql cost .001;

Now, let's assume that we want to drop the dummy_view view, however, there is uncertainty regarding whether a certain application depends on it, as follows:

CREATE OR REPLACE VIEW dummy_view AS
SELECT dummy_text FROM (VALUES('dummy')) as dummy(dummy_text);

To ensure that the view is not used, the view should be rewritten as follows, with the monitor_view_usage function used:

-- Recreat the view and inject the monitor_view_usage
CREATE OR REPLACE VIEW dummy_view AS 
SELECT dummy_text FROM (VALUES('dummy')) as dummy(dummy_text) cross join monitor_view_usage('dummy_view');

If the view is accessed, an entry in the log file should appear, as follows:

$ tail /var/log/postgresql/postgresql-10-main.log
2017-10-10 16:54:15.375 CEST [21874] postgres@postgres LOG: The view dummy_view is used on 16:54:15.374124+02 by postgres

Table of Contents for Cleaning up the database

Create new playlist

Sign In

Sign Up

Table of Contents for
Cleaning up the database