database - Efficient checking of possible duplicate entities -


A user database needs to prepare a list of potential duplicates before saving an entity and possibly warn them Duplicate

There are 7 criteria on which we should check the duplicates and if at least 3 matches we should flag the user. All the matches will be on the criteria ID, so no fuzzy string matching is necessary, but my problem comes from the fact that there are several possible ways to match for at least 3 items (99 ways if I have done my timeline. 7) List of prospects.

I do not want to do 99 different DB queries to search my search results nor do I want to bring back many DBs and filter on client side. We are probably talking about a few thousand records at present, but it will grow in millions because the system gets mature.

Has anyone found a good efficient way to do this? I was considering a simple or query to get those records where at least one field matches DB, and then the customer is doing some processing to filter some more, but some fields have very little cardinalis And will not really be less

or And case will work but quite inefficient, because they do not use the index May include.

You need to create UNION to index.

If a user enters names , phone , email and addresses databases , And you want to see all those records that match at least 3 of these areas, you issue:

  SELECT i. * FROM (SELECT id, COUNT (*) from (t_info T by selection ID) where name = 'Eve Chianese' union t_info T WHERE phone = '+15558000042' UniOnS All Select ID FD TIFO T WHERE email = ' 42 @ example.com 'Union All Select ID FDT_INFO T WHERE Address = '42 North Lane') ID (*)> = 3) dq on JOIN t_info i.id = dq if the ID Group has an ID. Id  

This will use the index on these areas and the query will expire soon.

View this blog in my blog for details:

  • : At least 3 of 4 How to get a match-related record
  • Also see that the article is based on.

    If you want to create a list of DISTINCT values ​​in existing data, you wrap this query in a subquery:

      SELECT i. * T_info i1 from where to select (Select from t_info t by selection ID where name = i1.name union selects all t_info T from ID where call = i1.phone union selects all t_info t where email = i1. Mail unius all select ID t_info T WHERE address = i1.address) q * (*)> = 3) ID that will be the ID by the group  

    Note that this DISTINCT is not infected: if one matches b and b matches c , this does not mean that < Code> a match c .


Comments

Popular posts from this blog

python - Overriding the save method in Django ModelForm -

html - CSS autoheight, but fit content to height of div -

qt - How to prevent QAudioInput from automatically boosting the master volume to 100%? -