oracleops-support: 2020

PostgreSQL Data Deduplication Methods

Background

Deduplication is a common process with many variants. For example:

1) Remove duplicates from a single column.

Retention Rule: Retain the latest or oldest record or the record with the largest value of a specified field.

2) Remove duplicates from multiple columns.

Retention Rule: Retain the latest or oldest record or the record with the largest value of a specified field.

3) Remove duplicates by rows.

Retention Rule: Retain the latest or oldest record or the record with the largest value of a specified field.

4) Multi-column hybrid deduplication (for example, ROW1: col1=1, col2=2; ROW2: col1=2, col2=1).

Retention Rule: Retain the latest or oldest record or the record with the largest value of a specified field.

5) If multiple versions of a record are generated due to pg_resetwal, determine the records to be deleted or retain the record with the largest XMIN value.

create unique index idx on tbl (c1,c2);
-- error

delete from tbl where ctid not in (select ctid from (select ctid, row_number() over(partition by c1,c2 order by (xmin)::text::int8 desc) as rn from tbl) t where rn=1 union all select ctid from tbl where c1 is null or c2 is null);

The following examples show how to use different methods to remove duplicates. Choose the one that works best for you.

Single-column Deduplication

Consider the following test data.

create table test1(id int primary key, c1 int, c2 timestamp);  
insert into test1 select generate_series(1,1000000), random()*1000, clock_timestamp();  
  
create index idx_test1 on test1(c1,id);    
-- 这个索引可以起到加速效果。如果没有这个索引，以下三种方法，第二种效率最高，其次是第三种。

Requirement: Remove repeated rows in c1 and retain the rows with the largest ID.

Method 1) Use Aggregation, NOT IN

postgres=# explain delete from test1 where id not in (select max(id) from test1 group by c1);  
                                                    QUERY PLAN                                                      
------------------------------------------------------------------------------------------------------------------  
 Delete on test1  (cost=35115.63..53023.01 rows=500055 width=6)  
   ->  Seq Scan on test1  (cost=35115.63..53023.01 rows=500055 width=6)  
         Filter: (NOT (hashed SubPlan 1))  
         SubPlan 1  
           ->  GroupAggregate  (cost=0.42..35113.13 rows=1001 width=8)  
                 Group Key: test1_1.c1  
                 ->  Index Only Scan using idx_test1 on test1 test1_1  (cost=0.42..30102.57 rows=1000110 width=8)  
(7 rows)  
Time: 0.564 ms  
  
postgres=# delete from test1 where id not in (select max(id) from test1 group by c1);  
DELETE 998999  
Time: 1126.504 ms (00:01.127)

Method 2) Use Window Query, IN

postgres=# explain select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1;  
                                            QUERY PLAN                                              
--------------------------------------------------------------------------------------------------  
 Subquery Scan on t  (cost=0.42..60075.54 rows=995109 width=4)  
   Filter: (t.rn <> 1)  
   ->  WindowAgg  (cost=0.42..47574.17 rows=1000110 width=16)  
         ->  Index Only Scan using idx_test1 on test1  (cost=0.42..30072.24 rows=1000110 width=8)  
(4 rows)  
Time: 0.512 ms  
  
postgres=# delete from test1 where id in (select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1);  
DELETE 998999  
Time: 2430.276 ms (00:02.430)

Method 3) Use PL/pgSQL or Use Sorting and Cursor Internally

Deduplication per record only involves the overhead of one sort operation and the comparison of each record.

do language plpgsql $$       
declare  
  v_rec record;  
  v_c1 int;  
  cur1 cursor for select c1,id from test1 order by c1,id for update;  
begin  
  for v_rec in cur1 loop  
    if v_rec.c1 = v_c1 then  
      delete from test1 where current of cur1;  
    end if;  
    v_c1 := v_rec.c1;  
  end loop;  
end;  
$$;  
  
DO  
Time: 7345.773 ms (00:07.346)  
  
postgres=# select count(*) from test1;  
 count   
-------  
  1001  
(1 row)  
  
Time: 61.672 ms  
postgres=# select * from test1 limit 10;  
 id | c1  |             c2               
----+-----+----------------------------  
  1 | 582 | 2017-06-02 10:21:10.60918  
  2 | 278 | 2017-06-02 10:21:10.609331  
  3 | 659 | 2017-06-02 10:21:10.609338  
  4 | 372 | 2017-06-02 10:21:10.609341  
  5 | 184 | 2017-06-02 10:21:10.609343  
  6 | 121 | 2017-06-02 10:21:10.609345  
  7 | 132 | 2017-06-02 10:21:10.609347  
  8 | 290 | 2017-06-02 10:21:10.609348  
  9 | 980 | 2017-06-02 10:21:10.60935  
 10 | 305 | 2017-06-02 10:21:10.609352  
(10 rows)

Cutting Edge Technology in PostgreSQL 10

Index sorting can be used even if some driving columns exist. For example, index(c1) is used for order by c1,id.

Multi-column Deduplication

Consider the following test data.

create table test1(id int primary key, c1 int, c2 int, c3 timestamp);  
insert into test1 select generate_series(1,1000000), random()*1000, random()*1000, clock_timestamp();  
  
create index idx_test1 on test1(c1,c2,id);    
-- 这个索引可以起到加速效果。

Requirement: Remove repeated rows in c1 and c2 and retain the rows with the largest ID.

Method 1

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id not in (select max(id) from test1 group by c1,c2);  
                                                                                 QUERY PLAN                                                                                   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=40820.36..59690.36 rows=500000 width=6) (actual time=1634.960..1634.960 rows=0 loops=1)  
   Buffers: shared hit=1378788  
   ->  Seq Scan on public.test1  (cost=40820.36..59690.36 rows=500000 width=6) (actual time=1090.956..1446.374 rows=367618 loops=1)  
         Output: test1.ctid  
         Filter: (NOT (hashed SubPlan 1))  
         Rows Removed by Filter: 632382  
         Buffers: shared hit=1011170  
         SubPlan 1  
           ->  GroupAggregate  (cost=0.42..40570.36 rows=100000 width=12) (actual time=0.035..842.497 rows=632382 loops=1)  
                 Output: max(test1_1.id), test1_1.c1, test1_1.c2  
                 Group Key: test1_1.c1, test1_1.c2  
                 Buffers: shared hit=1004800  
                 ->  Index Only Scan using idx_test1 on public.test1 test1_1  (cost=0.42..32070.36 rows=1000000 width=12) (actual time=0.027..587.506 rows=1000000 loops=1)  
                       Output: test1_1.c1, test1_1.c2, test1_1.id  
                       Heap Fetches: 1000000  
                       Buffers: shared hit=1004800  
 Planning time: 0.211 ms  
 Execution time: 1641.679 ms  
(18 rows)

Method 2

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id in (select id from (select row_number() over(partition by c1,c2 order by id) as rn, id from test1) t where t.rn<>1);  
                                                                                      QUERY PLAN                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=83752.89..130385.27 rows=995000 width=34) (actual time=2199.376..2199.376 rows=0 loops=1)  
   Buffers: shared hit=1378741, temp read=6482 written=6420  
   ->  Hash Semi Join  (cost=83752.89..130385.27 rows=995000 width=34) (actual time=1381.636..1929.284 rows=367584 loops=1)  
         Output: test1.ctid, t.*  
         Hash Cond: (test1.id = t.id)  
         Buffers: shared hit=1011157, temp read=6482 written=6420  
         ->  Seq Scan on public.test1  (cost=0.00..16370.00 rows=1000000 width=10) (actual time=0.013..140.130 rows=1000000 loops=1)  
               Output: test1.ctid, test1.id  
               Buffers: shared hit=6370  
         ->  Hash  (cost=64513.39..64513.39 rows=995000 width=32) (actual time=1377.349..1377.349 rows=367584 loops=1)  
               Output: t.*, t.id  
               Buckets: 65536  Batches: 32  Memory Usage: 1326kB  
               Buffers: shared hit=1004787, temp written=2591  
               ->  Subquery Scan on t  (cost=0.42..64513.39 rows=995000 width=32) (actual time=0.074..1269.919 rows=367584 loops=1)  
                     Output: t.*, t.id  
                     Filter: (t.rn <> 1)  
                     Rows Removed by Filter: 632416  
                     Buffers: shared hit=1004787  
                     ->  WindowAgg  (cost=0.42..52013.39 rows=1000000 width=20) (actual time=0.054..1117.668 rows=1000000 loops=1)  
                           Output: row_number() OVER (?), test1_1.id, test1_1.c1, test1_1.c2  
                           Buffers: shared hit=1004787  
                           ->  Index Only Scan using idx_test1 on public.test1 test1_1  (cost=0.42..32013.39 rows=1000000 width=12) (actual time=0.035..627.329 rows=1000000 loops=1)  
                                 Output: test1_1.c1, test1_1.c2, test1_1.id  
                                 Heap Fetches: 1000000  
                                 Buffers: shared hit=1004787  
 Planning time: 0.565 ms  
 Execution time: 2199.450 ms  
(27 rows)

Method 3

postgres=# do language plpgsql $$       
declare  
  v_rec record;  
  v_c1 int;   
  v_c2 int;    
  cur1 cursor for select c1,c2 from test1 order by c1,c2,id for update;     
begin  
  for v_rec in cur1 loop  
    if v_rec.c1 = v_c1 and v_rec.c2=v_c2 then  
      delete from test1 where current of cur1;  
    end if;  
    v_c1 := v_rec.c1;   
    v_c2 := v_rec.c2;  
  end loop;  
end;  
$$;  
DO  
  
Time: 4637.183 ms (00:04.637)

Row-based Deduplication

Using ctid = any(array(select ctid from ...)); to remove duplicates is the fastest method.

Consider the following test data.

create table test1(c1 int, c2 int);  
insert into test1 select random()*1000, random()*1000 from generate_series(1,1000000);  
  
-- 行号ctid 系统列无法创建索引

Requirement: Remove repeated rows and keep one record.

Reserve data by row number when no primary key exists.

Method 1

The NOT IN method of ctid causes a loop, which affects performance. We do not recommend using the NOT IN method.

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid not in (select max(ctid) from test1 group by c1,c2);  
^CCancel request sent  
ERROR:  57014: canceling statement due to user request  
LOCATION:  ProcessInterrupts, postgres.c:2984  
Time: 426433.450 ms (07:06.433)  

postgres=# truncate test1;
TRUNCATE TABLE
postgres=# insert into test1 select random()*1000, random()*1000 from generate_series(1,10000); 
INSERT 0 10000
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test1 where ctid not in (select max(ctid) from test1 group by c1,c2);  
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test1  (cost=222.97..90895.39 rows=5085 width=8) (actual time=54.451..7741.146 rows=51 loops=1)
   Output: test1.c1, test1.c2
   Filter: (NOT (SubPlan 1))
   Rows Removed by Filter: 9949
   Buffers: shared hit=90
   SubPlan 1

     发生了LOOP

     ->  Materialize  (cost=222.97..238.23 rows=1017 width=14) (actual time=0.001..0.297 rows=5000 loops=10000)
           Output: (max(test1_1.ctid)), test1_1.c1, test1_1.c2
           Buffers: shared hit=45
           ->  HashAggregate  (cost=222.97..233.14 rows=1017 width=14) (actual time=4.757..6.655 rows=9949 loops=1)
                 Output: max(test1_1.ctid), test1_1.c1, test1_1.c2
                 Group Key: test1_1.c1, test1_1.c2
                 Buffers: shared hit=45
                 ->  Seq Scan on public.test1 test1_1  (cost=0.00..146.70 rows=10170 width=14) (actual time=0.005..1.588 rows=10000 loops=1)
                       Output: test1_1.c1, test1_1.c2, test1_1.ctid
                       Buffers: shared hit=45
 Planning time: 0.121 ms
 Execution time: 7741.277 ms
(18 rows)
  
使用用户定义的列不会有这个问题,已反馈给社区  

drop table test1;
create table test1(id int, c1 int, c2 int);  
insert into test1 select id, random()*1000, random()*1000 from generate_series(1,10000) t(id); 

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test1 where id not in (select max(id) from test1 group by c1,c2);
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test1  (cost=1048.18..1243.43 rows=5610 width=12) (actual time=11.762..13.627 rows=48 loops=1)
   Output: test1.id, test1.c1, test1.c2
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 9952
   Buffers: shared hit=110
   SubPlan 1
     ->  GroupAggregate  (cost=921.96..1045.38 rows=1122 width=12) (actual time=5.355..9.162 rows=9952 loops=1)
           Output: max(test1_1.id), test1_1.c1, test1_1.c2
           Group Key: test1_1.c1, test1_1.c2
           Buffers: shared hit=55
           ->  Sort  (cost=921.96..950.01 rows=11220 width=12) (actual time=5.350..6.101 rows=10000 loops=1)
                 Output: test1_1.c1, test1_1.c2, test1_1.id
                 Sort Key: test1_1.c1, test1_1.c2
                 Sort Method: quicksort  Memory: 853kB
                 Buffers: shared hit=55
                 ->  Seq Scan on public.test1 test1_1  (cost=0.00..167.20 rows=11220 width=12) (actual time=0.004..1.528 rows=10000 loops=1)
                       Output: test1_1.c1, test1_1.c2, test1_1.id
                       Buffers: shared hit=55
 Planning time: 58.784 ms
 Execution time: 13.685 ms
(20 rows)

Method 2

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid = any(array( select ctid from (select row_number() over(partition by c1,c2 order by ctid) as rn, ctid from test1) t where t.rn<>1));  
                                                                                   QUERY PLAN                                                                                      
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=422032.41..427035.41 rows=500000 width=36) (actual time=3525.690..3525.690 rows=0 loops=1)  
   Buffers: shared hit=376073 dirtied=4398, temp read=10658 written=10683  
   ->  Merge Join  (cost=422032.41..427035.41 rows=500000 width=36) (actual time=3003.047..3352.172 rows=367223 loops=1)  
         Output: test1.ctid, t.*  
         Inner Unique: true  
         Merge Cond: (test1.ctid = t.ctid)  
         Buffers: shared hit=8850, temp read=10658 written=10683  
         ->  Sort  (cost=122873.59..125373.59 rows=1000000 width=6) (actual time=786.814..883.721 rows=1000000 loops=1)  
               Output: test1.ctid  
               Sort Key: test1.ctid  
               Sort Method: external sort  Disk: 15656kB  
               Buffers: shared hit=4425, temp read=1957 written=1957  
               ->  Seq Scan on public.test1  (cost=0.00..14425.00 rows=1000000 width=6) (actual time=0.021..112.431 rows=1000000 loops=1)  
                     Output: test1.ctid  
                     Buffers: shared hit=4425  
         ->  Sort  (cost=299158.81..299159.31 rows=200 width=36) (actual time=2216.021..2268.235 rows=367223 loops=1)  
               Output: t.*, t.ctid  
               Sort Key: t.ctid  
               Sort Method: external sort  Disk: 18688kB  
               Buffers: shared hit=4425, temp read=8701 written=8726  
               ->  Unique  (cost=294176.17..299151.17 rows=200 width=36) (actual time=1790.180..1949.522 rows=367223 loops=1)  
                     Output: t.*, t.ctid  
                     Buffers: shared hit=4425, temp read=6365 written=6390  
                     ->  Sort  (cost=294176.17..296663.67 rows=995000 width=36) (actual time=1790.179..1874.394 rows=367223 loops=1)  
                           Output: t.*, t.ctid  
                           Sort Key: t.ctid  
                           Sort Method: external merge  Disk: 18744kB  
                           Buffers: shared hit=4425, temp read=6365 written=6390  
                           ->  Subquery Scan on t  (cost=125069.59..160069.59 rows=995000 width=36) (actual time=692.878..1542.122 rows=367223 loops=1)  
                                 Output: t.*, t.ctid  
                                 Filter: (t.rn <> 1)  
                                 Rows Removed by Filter: 632777  
                                 Buffers: shared hit=4425, temp read=4022 written=4039  
                                 ->  WindowAgg  (cost=125069.59..147569.59 rows=1000000 width=22) (actual time=692.858..1401.210 rows=1000000 loops=1)  
                                       Output: row_number() OVER (?), test1_1.ctid, test1_1.c1, test1_1.c2  
                                       Buffers: shared hit=4425, temp read=4022 written=4039  
                                       ->  Sort  (cost=125069.59..127569.59 rows=1000000 width=14) (actual time=692.850..947.055 rows=1000000 loops=1)  
                                             Output: test1_1.ctid, test1_1.c1, test1_1.c2  
                                             Sort Key: test1_1.c1, test1_1.c2, test1_1.ctid  
                                             Sort Method: external merge  Disk: 25496kB  
                                             Buffers: shared hit=4425, temp read=4022 written=4039  
                                             ->  Seq Scan on public.test1 test1_1  (cost=0.00..14425.00 rows=1000000 width=14) (actual time=0.010..131.128 rows=1000000 loops=1)  
                                                   Output: test1_1.ctid, test1_1.c1, test1_1.c2  
                                                   Buffers: shared hit=4425  
 Planning time: 0.247 ms  
 Execution time: 3547.727 ms  
(46 rows)

Method 3

postgres=# do language plpgsql $$       
declare  
  v_rec record;  
  v_c1 int;   
  v_c2 int;    
  cur1 cursor for select c1,c2 from test1 order by c1,c2,ctid for update;     
begin  
  for v_rec in cur1 loop  
    if v_rec.c1 = v_c1 and v_rec.c2=v_c2 then  
      delete from test1 where current of cur1;  
    end if;  
    v_c1 := v_rec.c1;   
    v_c2 := v_rec.c2;  
  end loop;  
end;  
$$;  
DO  
  
Time: 5395.774 ms (00:05.396)

Multi-column Hybrid Deduplication

To remove duplicates in multiple columns, use arrays. However, if the element orders in two arrays are inconsistent, the elements in the arrays are different.

postgres=# select array[1,2] = array[2,1];  
 ?column?   
----------  
 f  
(1 row)  
  
postgres=# select array[1,2] @> array[2,1] and array[2,1] @> array[1,1,2];  
 ?column?   
----------  
 t  
(1 row)  
  
postgres=# select array[1,2] @> array[2,2,1] and array[2,1] @> array[1,1,2];  
 ?column?   
----------  
 t  
(1 row)

Therefore, sort and store the elements and use the columns that require deduplication as array elements.

Create a sorting function that supports any columns and output sorted arrays.

postgres=# create or replace function sort_vals(variadic v_arr text[]) returns text[] as $$  
  select array_agg(arr order by arr) from unnest(v_arr) t(arr);  
$$ language sql strict;  
  
postgres=# select sort_vals('a','a','b','a','c');  
  sort_vals    
-------------  
 {a,a,a,b,c}  
(1 row)

Consider the following test data.

create table test1(c1 int, c2 int);  
insert into test1 select random()*1000, random()*1000 from generate_series(1,1000000);

Requirement: Remove records repeated in c1 and c2 (for example, 1,2 and 2,1 are repeated records) and retain any record.

Method 1

Use sort_vals to sort and reorganize arrays. This method is simple and easy to understand.

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid = any(array (select ctid from (select row_number() over(partition by sort_vals(c1::text,c2::text) order by ctid) as rn, ctid from test1) t where t.rn<>1));  
                                                                                    QUERY PLAN                                                                                       
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=963704.16..968707.16 rows=500000 width=36) (actual time=16466.913..16466.913 rows=0 loops=1)  
   Buffers: shared hit=576071, temp read=18863 written=18901  
   ->  Merge Join  (cost=963704.16..968707.16 rows=500000 width=36) (actual time=15766.506..16202.766 rows=567213 loops=1)  
         Output: test1.ctid, t.*  
         Inner Unique: true  
         Merge Cond: (test1.ctid = t.ctid)  
         Buffers: shared hit=8858, temp read=18863 written=18901  
         ->  Sort  (cost=122873.59..125373.59 rows=1000000 width=6) (actual time=782.867..880.729 rows=1000000 loops=1)  
               Output: test1.ctid  
               Sort Key: test1.ctid  
               Sort Method: external sort  Disk: 15656kB  
               Buffers: shared hit=4425, temp read=1957 written=1957  
               ->  Seq Scan on public.test1  (cost=0.00..14425.00 rows=1000000 width=6) (actual time=0.009..110.757 rows=1000000 loops=1)  
                     Output: test1.ctid  
                     Buffers: shared hit=4425  
         ->  Sort  (cost=840830.56..840831.06 rows=200 width=36) (actual time=14983.595..15066.186 rows=567213 loops=1)  
               Output: t.*, t.ctid  
               Sort Key: t.ctid  
               Sort Method: external sort  Disk: 28864kB  
               Buffers: shared hit=4433, temp read=16906 written=16944  
               ->  Unique  (cost=835847.92..840822.92 rows=200 width=36) (actual time=14316.637..14568.357 rows=567213 loops=1)  
                     Output: t.*, t.ctid  
                     Buffers: shared hit=4433, temp read=13298 written=13336  
                     ->  Sort  (cost=835847.92..838335.42 rows=995000 width=36) (actual time=14316.636..14456.355 rows=567213 loops=1)  
                           Output: t.*, t.ctid  
                           Sort Key: t.ctid  
                           Sort Method: external merge  Disk: 28952kB  
                           Buffers: shared hit=4433, temp read=13298 written=13336  
                           ->  Subquery Scan on t  (cost=409241.34..701741.34 rows=995000 width=36) (actual time=12177.370..13945.667 rows=567213 loops=1)  
                                 Output: t.*, t.ctid  
                                 Filter: (t.rn <> 1)  
                                 Rows Removed by Filter: 432787  
                                 Buffers: shared hit=4433, temp read=9679 written=9704  
                                 ->  WindowAgg  (cost=409241.34..689241.34 rows=1000000 width=46) (actual time=12177.303..13765.873 rows=1000000 loops=1)  
                                       Output: row_number() OVER (?), test1_1.ctid, (sort_vals(VARIADIC ARRAY[(test1_1.c1)::text, (test1_1.c2)::text]))  
                                       Buffers: shared hit=4433, temp read=9679 written=9704  
                                       ->  Sort  (cost=409241.34..411741.34 rows=1000000 width=38) (actual time=12177.293..13163.065 rows=1000000 loops=1)  
                                             Output: test1_1.ctid, (sort_vals(VARIADIC ARRAY[(test1_1.c1)::text, (test1_1.c2)::text]))  
                                             Sort Key: (sort_vals(VARIADIC ARRAY[(test1_1.c1)::text, (test1_1.c2)::text])), test1_1.ctid  
                                             Sort Method: external merge  Disk: 51904kB  
                                             Buffers: shared hit=4430, temp read=9679 written=9704  
                                             ->  Seq Scan on public.test1 test1_1  (cost=0.00..274425.00 rows=1000000 width=38) (actual time=0.202..8735.620 rows=1000000 loops=1)  
                                                   Output: test1_1.ctid, sort_vals(VARIADIC ARRAY[(test1_1.c1)::text, (test1_1.c2)::text])  
                                                   Buffers: shared hit=4425  
 Planning time: 0.292 ms  
 Execution time: 16500.934 ms  
(46 rows)

Method 2

Use SUBQUERY to sort and reorganize arrays. This method is efficient but complicated.

explain (analyze,verbose,timing,costs,buffers)   
delete from test1 where ctid = any(array(   
select rid from  
  (  
    select row_number() over(partition by val order by rid) as rn, rid from   
      (  
        select rid, array_agg(arr order by arr) val from   
          (select ctid rid, unnest(array[c1,c2]) arr from test1) t  
        group by rid  
      ) t  
  ) t  
where t.rn<>1  
));  
  
                                                                                      QUERY PLAN                                                                                         
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=18979422.39..18984425.62 rows=199 width=36) (actual time=10186.459..10186.459 rows=0 loops=1)  
   Buffers: shared hit=575991, temp read=19174 written=19212  
   ->  Merge Semi Join  (cost=18979422.39..18984425.62 rows=199 width=36) (actual time=9421.861..9923.844 rows=567141 loops=1)  
         Output: test1.ctid, t.*  
         Merge Cond: (test1.ctid = t.rid)  
         Buffers: shared hit=8850, temp read=19174 written=19212  
         ->  Sort  (cost=122879.44..125379.56 rows=1000050 width=6) (actual time=796.023..894.723 rows=1000000 loops=1)  
               Output: test1.ctid  
               Sort Key: test1.ctid  
               Sort Method: external sort  Disk: 15656kB  
               Buffers: shared hit=4425, temp read=1957 written=1957  
               ->  Seq Scan on public.test1  (cost=0.00..14425.50 rows=1000050 width=6) (actual time=0.016..122.654 rows=1000000 loops=1)  
                     Output: test1.ctid  
                     Buffers: shared hit=4425  
         ->  Sort  (cost=18856542.95..18856543.45 rows=199 width=36) (actual time=8625.739..8775.583 rows=567141 loops=1)  
               Output: t.*, t.rid  
               Sort Key: t.rid  
               Sort Method: external merge  Disk: 28952kB  
               Buffers: shared hit=4425, temp read=17217 written=17255  
               ->  Subquery Scan on t  (cost=18856528.85..18856535.35 rows=199 width=36) (actual time=6749.159..8251.185 rows=567141 loops=1)  
                     Output: t.*, t.rid  
                     Filter: (t.rn <> 1)  
                     Rows Removed by Filter: 432859  
                     Buffers: shared hit=4425, temp read=13598 written=13623  
                     ->  WindowAgg  (cost=18856528.85..18856532.85 rows=200 width=46) (actual time=6749.138..8073.103 rows=1000000 loops=1)  
                           Output: row_number() OVER (?), test1_1.ctid, (array_agg((unnest(ARRAY[test1_1.c1, test1_1.c2])) ORDER BY (unnest(ARRAY[test1_1.c1, test1_1.c2]))))  
                           Buffers: shared hit=4425, temp read=13598 written=13623  
                           ->  Sort  (cost=18856528.85..18856529.35 rows=200 width=38) (actual time=6749.128..7507.854 rows=1000000 loops=1)  
                                 Output: test1_1.ctid, (array_agg((unnest(ARRAY[test1_1.c1, test1_1.c2])) ORDER BY (unnest(ARRAY[test1_1.c1, test1_1.c2]))))  
                                 Sort Key: (array_agg((unnest(ARRAY[test1_1.c1, test1_1.c2])) ORDER BY (unnest(ARRAY[test1_1.c1, test1_1.c2])))), test1_1.ctid  
                                 Sort Method: external merge  Disk: 44040kB  
                                 Buffers: shared hit=4425, temp read=13598 written=13623  
                                 ->  GroupAggregate  (cost=18106479.21..18856519.21 rows=200 width=38) (actual time=2315.955..4053.484 rows=1000000 loops=1)  
                                       Output: test1_1.ctid, array_agg((unnest(ARRAY[test1_1.c1, test1_1.c2])) ORDER BY (unnest(ARRAY[test1_1.c1, test1_1.c2])))  
                                       Group Key: test1_1.ctid  
                                       Buffers: shared hit=4425, temp read=5382 written=5382  
                                       ->  Sort  (cost=18106479.21..18356491.71 rows=100005000 width=10) (actual time=2315.934..2530.362 rows=2000000 loops=1)  
                                             Output: test1_1.ctid, (unnest(ARRAY[test1_1.c1, test1_1.c2]))  
                                             Sort Key: test1_1.ctid  
                                             Sort Method: external sort  Disk: 43056kB  
                                             Buffers: shared hit=4425, temp read=5382 written=5382  
                                             ->  ProjectSet  (cost=0.00..521950.88 rows=100005000 width=10) (actual time=0.019..836.774 rows=2000000 loops=1)  
                                                   Output: test1_1.ctid, unnest(ARRAY[test1_1.c1, test1_1.c2])  
                                                   Buffers: shared hit=4425  
                                                   ->  Seq Scan on public.test1 test1_1  (cost=0.00..14425.50 rows=1000050 width=14) (actual time=0.010..137.319 rows=1000000 loops=1)  
                                                         Output: test1_1.c1, test1_1.c2, test1_1.ctid  
                                                         Buffers: shared hit=4425  
 Planning time: 0.241 ms  
 Execution time: 10228.121 ms  
(49 rows)

Method 3

postgres=# do language plpgsql $$       
declare  
  v_rec record;  
  v_arr text[];   
  cur1 cursor for select sort_vals(c1::text,c2::text) as arr from test1 order by sort_vals(c1::text,c2::text),ctid for update;     
begin  
  for v_rec in cur1 loop  
    if v_rec.arr = v_arr then  
      delete from test1 where current of cur1;  
    end if;  
    v_arr := v_rec.arr;   
  end loop;  
end;  
$$;  
DO  
  
Time: 18542.457 ms (00:18.542)

Summary

The performance of the three methods varies greatly depending on whether index acceleration is applied. However, the most stable method is to use window functions. Therefore, we recommend using window functions. When the element orders in two arrays are inconsistent, the elements in the two arrays are different and you can customize the shuffling function.

:courtesy - alibabacloud.

Copy Table With or Without Data Using Create Table As Statement

Create Table As

When writing an oracle app or sometime for testing purposes we are required to create a copy of a table. There can be several ways of accomplishing this task such as –

You can manually see the structure of the table using DESC command and then write a create table DML statement and create the table.

You can use the Export and Import utility for creating the copy and many more.

I say, why bear the pain of doing all these hard work and waste your precious time when you have a simple and easy-to-use tool at hand.

Yep, you heard it right, oracle provides a simple way/command of creating a copy of a table either with the data or without the data. It is called “Create Table As” command.

“Create Table As” is a DDL command which lets you create an exact copy of a table of your database with or without the data

Syntax:

CREATE TABLE table_name_1

SELECT [*]/column_name(s) FROM table_name_2 WHERE expression;

Syntax is pretty simple CREATE TABLE is oracle reserved phrase followed by the user defined name of the table ‘table_name_1’ with which you want to create your new table then we have mandatory keyword ‘AS’ followed by a simple SELECT DML statement.

Table_name_2 is the name of the table whose copy you want to make and table_name_1 is the name of your new table.

Examples:

For the demonstration we will use the Employees table of HR sample schema of the Oracle Database.

E.g. 1: How to copy both the structure and data of a table.

Suppose you want to make an exact copy of employees table of HR schema and by exact copy I mean copying the whole structure along with the data of employees table.

Copying the exact structure and data of the employees table can be done by writing a very simple Create table statement. That statement will be

CREATE TABLE employees_copy

SELECT * FROM employees;

Successful execution of the above command will create the employees_copy which will be having the same structure and data as of the Employees table of HR schema.

E.g. 2. How to copy specific columns of a table along with their data.

Now suppose you want to copy only first_name, last_name or email columns with data of employees table in this case CREATE TABLE statement will be.

CREATE TABLE employees_copy

SELECT first_name, last_name, email FROM employees;

Successful execution of the above command will create the table employees_copy this time with only column first_name, last_name and email and the data.

E.g. 3. How to copy only structure of the table without the data.

Sometimes you might be required to copy only the structure of the table sans the data. In this case what you can do is copy the whole table as shown in above examples and then truncate it later but this will need two operations or two different commands to execute for a single task: Such a waste of time & resources when you can accomplish the same just by writing a simple “Create Table As command”.

The Create Table command will remain the same as shown in above examples. You just need to add a where clause which will never set to be true or always false for example where clause such as ‘Cats’ = ‘Dog’ or 0=1. Let’s see how

CREATE TABLE employees_copy

SELECT first_name, last_name, email FROM employees WHERE 1=0;

ORA-02297: cannot disable constraint ( .. ) - dependencies exist

SQL> alter table scott.employee disable constraint employee_pk ;

ORA-02297: cannot disable constraint (SCOTT.EMPLOYEE_PK) - dependencies exist

Problem:

Disable constraint command fails as the table is parent table and it has foreign

key that are dependent on this constraint.

Fix:

There are two things we can do here.

1)Find foreign key constraints on the table and disable those foreign key constraints and then disable this table constraint.

Following query will check dependent table and the dependent constraint name.

After that disable child first and then parent constraint.

SELECT p.table_name "Parent Table", c.table_name "Child Table",

p.constraint_name "Parent Constraint", c.constraint_name "Child Constraint"

FROM user_constraints p

JOIN user_constraints c ON(p.constraint_name=c.r_constraint_name)

WHERE (p.constraint_type = 'P' OR p.constraint_type = 'U')

AND c.constraint_type = 'R' AND p.table_name = UPPER('&table_name')

The following query will generate a script to drop the child constraints

select 'alter table '||c.table_name||' disable constraint '||c.constraint_name||' ;'

FROM user_constraints p

JOIN user_constraints c ON(p.constraint_name=c.r_constraint_name)

WHERE (p.constraint_type = 'P' OR p.constraint_type = 'U')

AND c.constraint_type = 'R' AND p.table_name = UPPER('&table_name')

2)Disable the constraint with cascade option.

SQL> alter table transaction disable constraint EMPLOYEE_PK cascade;</b>

Handy one more Query ^^^

SELECT p.owner, p.table_name "Parent Table", c.table_name "Child Table",

p.constraint_name "Parent Constraint", c.constraint_name "Child Constraint"

FROM all_constraints p

JOIN all_constraints c ON(p.constraint_name=c.r_constraint_name)

WHERE (p.constraint_type = 'P' OR p.constraint_type = 'U')

AND c.constraint_type = 'R'

--AND p.OWNER = ''

AND p.table_name = UPPER('&table_name');

Optimize Oracle UNDO Parameter

Overview

Starting in Oracle9i, rollback segments are re-named undo logs. Traditionally transaction undo information was stored in Rollback Segments until a commit or rollback statement was issued, at which point it was made available for overlaying.

Best of all, automatic undo management allows the DBA to specify how long undo information should be retained after commit, preventing "snapshot too old" errors on long running queries.

This is done by setting the UNDO_RETENTION parameter. The default is 300 seconds (5 minutes), and you can set this parameter to guarantee that Oracle keeps undo logs for extended periods of time.

Rather than having to define and manage rollback segments, you can simply define an Undo tablespace and let Oracle take care of the rest. Turning on automatic undo management is easy. All you need to do is create an undo tablespace and set UNDO_MANAGEMENT = AUTO.

However it is worth to tune the following important parameters The size of the UNDO tablespace The UNDO_RETENTION parameter

Calculate UNDO_RETENTION for given UNDO Tabespace

You can choose to allocate a specific size for the UNDO tablespace and then set the UNDO_RETENTION parameter to an optimal value according to the UNDO size and the database activity. If your disk space is limited and you do not want to allocate more space than necessary to the UNDO tablespace, this is the way to proceed. The following query will help you to optimize the UNDO_RETENTION parameter:

Because these following queries use the V$UNDOSTAT statistics, run the queries only after the database has been running with UNDO for a significant and representative time!

SELECT SUM(a.bytes) "UNDO_SIZE"
FROM v$datafile a,
       v$tablespace b,
       dba_tablespaces c
WHERE c.contents = 'UNDO'
   AND c.status = 'ONLINE'
   AND b.name = c.tablespace_name
   AND a.ts# = b.ts#;

UNDO_SIZE
----------
  209715200

SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
"UNDO_BLOCK_PER_SEC"
FROM v$undostat;

UNDO_BLOCK_PER_SEC
------------------
3.12166667

SELECT TO_NUMBER(value) "DB_BLOCK_SIZE [KByte]"
FROM v$parameter
WHERE name = 'db_block_size';

DB_BLOCK_SIZE [Byte]
--------------------
4096

209'715'200 / (3.12166667 * 4'096) = 16'401 [Sec]

Using Inline Views, you can do all in one query!

SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]",
       SUBSTR(e.value,1,25) "UNDO RETENTION [Sec]",
       ROUND((d.undo_size / (to_number(f.value) *
       g.undo_block_per_sec))) "OPTIMAL UNDO RETENTION [Sec]"
FROM (
       SELECT SUM(a.bytes) undo_size
          FROM v$datafile a,
               v$tablespace b,
               dba_tablespaces c
         WHERE c.contents = 'UNDO'
           AND c.status = 'ONLINE'
           AND b.name = c.tablespace_name
           AND a.ts# = b.ts#
       ) d,
       v$parameter e,
       v$parameter f,
       (
       SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
              undo_block_per_sec
         FROM v$undostat
       ) g
WHERE e.name = 'undo_retention'
AND f.name = 'db_block_size'
/

ACTUAL UNDO SIZE [MByte]
------------------------
200

UNDO RETENTION [Sec]
--------------------
10800

OPTIMAL UNDO RETENTION [Sec]
----------------------------
16401

Calculate Needed UNDO Size for given Database Activity

If you are not limited by disk space, then it would be better to choose the UNDO_RETENTION time that is best for you (for FLASHBACK, etc.). Allocate the appropriate size to the UNDO tablespace according to the database activity:

Again, all in one query:
SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]",
       SUBSTR(e.value,1,25) "UNDO RETENTION [Sec]",
       (TO_NUMBER(e.value) * TO_NUMBER(f.value) *
       g.undo_block_per_sec) / (1024*1024) 
      "NEEDED UNDO SIZE [MByte]"
  FROM (
       SELECT SUM(a.bytes) undo_size
         FROM v$datafile a,
              v$tablespace b,
              dba_tablespaces c
        WHERE c.contents = 'UNDO'
          AND c.status = 'ONLINE'
          AND b.name = c.tablespace_name
          AND a.ts# = b.ts#
       ) d,
      v$parameter e,
       v$parameter f,
       (
       SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
         undo_block_per_sec
         FROM v$undostat
       ) g
 WHERE e.name = 'undo_retention'
  AND f.name = 'db_block_size'
/
ACTUAL UNDO SIZE [MByte]
------------------------
200
UNDO RETENTION [Sec] 
--------------------
10800
NEEDED UNDO SIZE [MByte]
------------------------
131.695313
The previous query may return a "NEEDED UNDO SIZE" that is less than the "ACTUAL UNDO SIZE". If this is the case, you may be wasting space. You can choose to resize your UNDO tablespace to a lesser value or increase your UNDO_RETENTION parameter to use the additional space.

How to Calculate the size of archive log files each day

Calculate the size of archive log files each day

SQL> SELECT TRUNC(COMPLETION_TIME) ARCHIVED_DATE,
SUM(BLOCKS * BLOCK_SIZE) / 1024 / 1024 SIZE_IN_MB
FROM V$ARCHIVED_LOG
GROUP BY TRUNC(COMPLETION_TIME)
ORDER BY 1;

ARCHIVED_DATE SIZE_IN_MB
------------- ----------
2020-02-14 00:00:00 2406.45117
2020-02-15 00:00:00 3491.56445
2020-02-16 00:00:00 3196.87354
2020-02-17 00:00:00 2917.70605
2020-02-18 00:00:00 2798.62158

2020-02-19 00:00:00 2650.23242

Size of the archive log files each hour.

SQL> alter session set nls_date_format = 'YYYY-MM-DD HH24';

Session altered.

SQL> SELECT TRUNC(COMPLETION_TIME, 'HH') ARCHIVED_DATE_HOUR,
SUM(BLOCKS * BLOCK_SIZE) / 1024 / 1024 SIZE_IN_MB
FROM V$ARCHIVED_LOG
GROUP BY TRUNC(COMPLETION_TIME, 'HH')
ORDER BY 1;

ARCHIVED_DATE SIZE_IN_MB
------------- ----------

2020-01-21 02 13304.7856

Size of the archive log files each day per instance

SQL> SELECT
TRUNC(COMPLETION_TIME) ARCHIVED_DATE,
THREAD#,
SUM(BLOCKS * BLOCK_SIZE) / 1024 / 1024 SIZE_IN_MB
FROM V$ARCHIVED_LOG
GROUP BY TRUNC(COMPLETION_TIME), THREAD#
ORDER BY 1, 2;

ARCHIVED_ THREAD# SIZE_IN_MB
--------- ---------- ----------

26-JAN-20 1 148411.591

Online Redo Log Switch Frequency Map Query

set pages 999 lines 400
col h0 format 999
col h1 format 999
col h2 format 999
col h3 format 999
col h4 format 999
col h5 format 999
col h6 format 999
col h7 format 999
col h8 format 999
col h9 format 999
col h10 format 999
col h11 format 999
col h12 format 999
col h13 format 999
col h14 format 999
col h15 format 999
col h16 format 999
col h17 format 999
col h18 format 999
col h19 format 999
col h20 format 999
col h21 format 999
col h22 format 999
col h23 format 999
SELECT TRUNC (first_time) "Date", inst_id, TO_CHAR (first_time, 'Dy') "Day",
COUNT (1) "Total",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) "h0",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23",
ROUND (COUNT (1) / 24, 2) "Avg"
FROM gv$log_history
WHERE thread# = inst_id
AND first_time > sysdate -7
GROUP BY TRUNC (first_time), inst_id, TO_CHAR (first_time, 'Dy')
ORDER BY 1,2;

HOW TO EXPORT AN EXCEL FILE TO PIPE DELIMITED FILE RATHER THAN COMMA DELIMITED FILE

Windows 7 - Following to change the delimiter:

- Make sure Excel is closed
- Navigate to control panel
- Select ‘Region and Language’
- Click the ‘Additional Settings’ button
- Find the List separator and change it from a comma to your preferred delimiter such as a pipe (|).
- Click OK
- Click OK
- Exit Control panel
- Open the Excel file you want to export to a pipe delimited file
- Select File, Save As
- Change the ‘Save as type’ to ‘CSV (Comma delimited)(*.csv)’
- Change the name and file extension if you want, by default stays as csv even though a different delimiter
- Click Save
- Click OK
- Click Yes

Wikipedia

PostgreSQL Data Deduplication Methods

Background

Single-column Deduplication

Method 1) Use Aggregation, NOT IN

Method 2) Use Window Query, IN

Method 3) Use PL/pgSQL or Use Sorting and Cursor Internally

Cutting Edge Technology in PostgreSQL 10

Multi-column Deduplication

Method 1

Method 2

Method 3

Row-based Deduplication

Method 1

Method 2

Method 3

Multi-column Hybrid Deduplication

Method 1

Method 2

Method 3

Summary

Copy Table With or Without Data Using Create Table As Statement

ORA-02297: cannot disable constraint ( .. ) - dependencies exist

Optimize Oracle UNDO Parameter

How to Calculate the size of archive log files each day

HOW TO EXPORT AN EXCEL FILE TO PIPE DELIMITED FILE RATHER THAN COMMA DELIMITED FILE