performance - SQL: Counting and Numbering Duplicates - Optimising Correlated Subquery -
I have a table in a SQLite database where I need to count the number of duplicates in some columns (i.e. the lines where there are 3 special columns) and then each of these cases (i.e. if there are 2 versions of a particular duplicate, They need to be counted as 1 and 2). I find it difficult to explain in words, so I will use a simple example below.
I have the same data as the following (first line header row, the table reference is in the form of "Idcountdata"):
id match1 match2 match3 data 1 AbCde BC 0 data01 2 AbCde BC 0 data02 3 AbCde BC 1 data03 4 AbCde AB 0 data04 5 FGhiJ BC 0 data05 6 FGhiJ AB 0 data06 7 FGhiJ BC 1 data07 8 FGhiJ B.C. 1 data08 9 FGhiJ BC 2 data09 10 HkLMop BC 1 data10 11 HkLMop BC 1 data11 12 HkLMop BC 1 data12 13 HkLMop de 1 data13 14 HkLMop de 2 data14 15 HkLMop de 2 data15 16 HkLMop de 2 data 16 17 Hklmpe DE2 data 17 And for this I will be output output:
id match1 match2 match3 data matchi d matchcount 1 ABCDE BC 0 data01 1 2 2 ABCDE BC 0 data02 2 2 3 ABCDE B.C. 1 data03 1 1 4 ABCDE AB 0 data04 1 1 5 FGhiJ BC 0 data05 1 1 6 FGhiJ AB 0 data06 1 1 7 FGhiJ B.C. 1 data07 1 to 2 FGhiJ B.C. 1 data08 2 2 9 FGhiJ BC 2 data09 1 1 10 HkLMop BC 1 data10 1 3 11 HkLMop BC 1 data11 2 3 12 HkLMop BC 1 data12 3 3 13 HkLMop de 1 data13 1 14 14 HkLMop de 2 data14 1 to 4 15 hklmpe de2 data 15 2 4 16 hklmpe de2 data 16 3 4 17 hklmop d2 data 17 4 4 Before that I was using some correlated subqueries to get it:
SELECT id, match1, match2, match3, data, (selection number (*) from IDK Ountdata d2 ou d1.match1 = d2.match1 and d1.match2 = d2.match2 and d1.match3 = d2.match3 and d2.id & lt; = d1.id) as matchid, (selection count (*) idcountdata From d2 to ou d1 .match1 = d2.match1 and d1.match2 = d2.match2 and d1.match3 = d2.match3) from idcountadata d1 Elboks; But there are more than 200,000 rows in the table (and the data length / content can have variables) and so it takes time to run. (Strangely, when I first returned the same query from the same data in mid-to-late 2013, it took minutes instead of the hour, but it is next to the point - even Even back I thought it was unusual and incompetent.)
I have already alter correlated subquery for a jointly combined "matchcount" for an unorganized subquery:
Include idcountdata SELECT d1.id, d1.match1, d1 .match2, d1.match3, d1.data, matchcount D1 (id, ma Select tch1, match2, match3, count (*) match1, match2, match3 by matchcount from idcountdata group (d1.match1 = d2) match1 and d1 match2 = d 2. match 2 and d1 Match 3 = D 2. Match 3); So this is the only subkey for "mitigate" which I need some help in optimizing. In summary, the following queries run very slowly for large datasets:
SELECT id, match1, match2, match3, data, (ID count from Data SELECT count (*) WHERE d1 Match1 = d2.match1 and d1.match2 = d2.match2 and d1.match3 = d2.match3 and D2.id & lt; = d1.id) IDCount data mailed to D1; How can I improve the performance of the above queries?
It does not have to move in seconds but needs to be minutes instead of hours (about 200,000 rows)
A self joining can be faster than a correlated subquery
select d1.id, d1.match1, d1.match2, d1.match3, d1.data, Count (*) matchid to idcountdata d1.match1 = d2.match1 and d1.match2 = d2.match2 and d1 include idcountdata d2 d1 .match3 = d2.match3 and d1.id & gt; = D2.id GROUP by d1.id, d1.match1, d1.match2, d1.match3, d1.data This query is (Mail 1, Match 2, Match 3, ID) can take advantage of a composite index