What methods can be used to detect duplicacy in image dataset? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election Resultsapplication of Histogram of oriented gradients in colored imageConvolutional network for classification, extremely sensitive to lightingData Preprocessing, how separate background from image to detect animals?Type of images used to train a neural networkAdvantages of one shot learning over image classificationDetecting if an image can be made BW/Greyscale/ColourHow to use Autoencoders for outlier detection on imagesCan I use a GAN to increase my Dataset used for Image detection?Detect presence of text in imageHow to filter out babies from image dataset

How come Sam didn't become Lord of Horn Hill?

Matrices and TikZ : arrows inside the matrix

The logistics of corpse disposal

How to compare two different files line by line in unix?

Why didn't this character "real die" when they blew their stack out in Altered Carbon?

What exactly is a "Meth" in Altered Carbon?

Is the Standard Deduction better than Itemized when both are the same amount?

3 doors, three guards, one stone

What do you call the main part of a joke?

List *all* the tuples!

How does debian/ubuntu knows a package has a updated version

In predicate logic, does existential quantification (∃) include universal quantification (∀), i.e. can 'some' imply 'all'?

Can an alien society believe that their star system is the universe?

Why didn't Eitri join the fight?

Most bit efficient text communication method?

How to answer "Have you ever been terminated?"

Denied boarding although I have proper visa and documentation. To whom should I make a complaint?

How to find all the available tools in mac terminal?

Error "illegal generic type for instanceof" when using local classes

How does the particle を relate to the verb 行く in the structure「A を + B に行く」?

What is the meaning of the new sigil in Game of Thrones Season 8 intro?

How can I use the Python library networkx from Mathematica?

How can I make names more distinctive without making them longer?

Does Amorayim read berayta in Gemara rather than recite it



What methods can be used to detect duplicacy in image dataset?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election Resultsapplication of Histogram of oriented gradients in colored imageConvolutional network for classification, extremely sensitive to lightingData Preprocessing, how separate background from image to detect animals?Type of images used to train a neural networkAdvantages of one shot learning over image classificationDetecting if an image can be made BW/Greyscale/ColourHow to use Autoencoders for outlier detection on imagesCan I use a GAN to increase my Dataset used for Image detection?Detect presence of text in imageHow to filter out babies from image dataset










2












$begingroup$


I want to remove duplicate images from a dataset of 50Million images. What is the best method to detect all the duplicates?



Do you think one shot learning is good for this?










share|improve this question











$endgroup$











  • $begingroup$
    Exact duplicates?
    $endgroup$
    – Michael M
    Sep 28 '18 at 19:16










  • $begingroup$
    No, even augmented ones.
    $endgroup$
    – thanatoz
    Sep 28 '18 at 19:17















2












$begingroup$


I want to remove duplicate images from a dataset of 50Million images. What is the best method to detect all the duplicates?



Do you think one shot learning is good for this?










share|improve this question











$endgroup$











  • $begingroup$
    Exact duplicates?
    $endgroup$
    – Michael M
    Sep 28 '18 at 19:16










  • $begingroup$
    No, even augmented ones.
    $endgroup$
    – thanatoz
    Sep 28 '18 at 19:17













2












2








2





$begingroup$


I want to remove duplicate images from a dataset of 50Million images. What is the best method to detect all the duplicates?



Do you think one shot learning is good for this?










share|improve this question











$endgroup$




I want to remove duplicate images from a dataset of 50Million images. What is the best method to detect all the duplicates?



Do you think one shot learning is good for this?







deep-learning predictive-modeling data-cleaning image-classification ensemble-modeling






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 30 '18 at 14:07







thanatoz

















asked Sep 28 '18 at 19:09









thanatozthanatoz

643421




643421











  • $begingroup$
    Exact duplicates?
    $endgroup$
    – Michael M
    Sep 28 '18 at 19:16










  • $begingroup$
    No, even augmented ones.
    $endgroup$
    – thanatoz
    Sep 28 '18 at 19:17
















  • $begingroup$
    Exact duplicates?
    $endgroup$
    – Michael M
    Sep 28 '18 at 19:16










  • $begingroup$
    No, even augmented ones.
    $endgroup$
    – thanatoz
    Sep 28 '18 at 19:17















$begingroup$
Exact duplicates?
$endgroup$
– Michael M
Sep 28 '18 at 19:16




$begingroup$
Exact duplicates?
$endgroup$
– Michael M
Sep 28 '18 at 19:16












$begingroup$
No, even augmented ones.
$endgroup$
– thanatoz
Sep 28 '18 at 19:17




$begingroup$
No, even augmented ones.
$endgroup$
– thanatoz
Sep 28 '18 at 19:17










2 Answers
2






active

oldest

votes


















1












$begingroup$

I think the dhash technique might help. It essentially creates a signature for each image, then you could isolate the duplicated images. 50M could take a while, so perhaps you can try that with a smaller subset and see how well it works.






share|improve this answer









$endgroup$












  • $begingroup$
    Is there a descriptive guide to use this apart from the official jetsetter page?
    $endgroup$
    – thanatoz
    Oct 1 '18 at 7:22











  • $begingroup$
    Are you looking for implementation example in a certain language? If you look through the jetsetter article and its references, you can see code sample implementation in C#, PHP, etc. Many people are kind enough to share the code via github as well and hopefully one of them would work for you.
    $endgroup$
    – The Lyrist
    Oct 1 '18 at 18:29


















0












$begingroup$

So, this is a simple problem that could be solved using one-shot learning technique. To achieve this, we must build a model that understands our data and is capable of finding similarity or dissimilarity in your data.



For this, we must carry out the following steps:



  1. Train (or finetune) the network on dataset of related images.

  2. After training the model, clip the last predicting layers to create embedding.

  3. Pass your testing data through the network and store individual embedding.

  4. Find the difference between the embedding and find the differences crossing a certain threshold.

  5. These images are potentially images having similar data and this could be easily used to find duplicacy in the dataset.

1shot



I referred this paper on oneshot learning and later found this blog to be a little helpful.






share|improve this answer









$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38924%2fwhat-methods-can-be-used-to-detect-duplicacy-in-image-dataset%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    I think the dhash technique might help. It essentially creates a signature for each image, then you could isolate the duplicated images. 50M could take a while, so perhaps you can try that with a smaller subset and see how well it works.






    share|improve this answer









    $endgroup$












    • $begingroup$
      Is there a descriptive guide to use this apart from the official jetsetter page?
      $endgroup$
      – thanatoz
      Oct 1 '18 at 7:22











    • $begingroup$
      Are you looking for implementation example in a certain language? If you look through the jetsetter article and its references, you can see code sample implementation in C#, PHP, etc. Many people are kind enough to share the code via github as well and hopefully one of them would work for you.
      $endgroup$
      – The Lyrist
      Oct 1 '18 at 18:29















    1












    $begingroup$

    I think the dhash technique might help. It essentially creates a signature for each image, then you could isolate the duplicated images. 50M could take a while, so perhaps you can try that with a smaller subset and see how well it works.






    share|improve this answer









    $endgroup$












    • $begingroup$
      Is there a descriptive guide to use this apart from the official jetsetter page?
      $endgroup$
      – thanatoz
      Oct 1 '18 at 7:22











    • $begingroup$
      Are you looking for implementation example in a certain language? If you look through the jetsetter article and its references, you can see code sample implementation in C#, PHP, etc. Many people are kind enough to share the code via github as well and hopefully one of them would work for you.
      $endgroup$
      – The Lyrist
      Oct 1 '18 at 18:29













    1












    1








    1





    $begingroup$

    I think the dhash technique might help. It essentially creates a signature for each image, then you could isolate the duplicated images. 50M could take a while, so perhaps you can try that with a smaller subset and see how well it works.






    share|improve this answer









    $endgroup$



    I think the dhash technique might help. It essentially creates a signature for each image, then you could isolate the duplicated images. 50M could take a while, so perhaps you can try that with a smaller subset and see how well it works.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Sep 28 '18 at 21:32









    The LyristThe Lyrist

    419113




    419113











    • $begingroup$
      Is there a descriptive guide to use this apart from the official jetsetter page?
      $endgroup$
      – thanatoz
      Oct 1 '18 at 7:22











    • $begingroup$
      Are you looking for implementation example in a certain language? If you look through the jetsetter article and its references, you can see code sample implementation in C#, PHP, etc. Many people are kind enough to share the code via github as well and hopefully one of them would work for you.
      $endgroup$
      – The Lyrist
      Oct 1 '18 at 18:29
















    • $begingroup$
      Is there a descriptive guide to use this apart from the official jetsetter page?
      $endgroup$
      – thanatoz
      Oct 1 '18 at 7:22











    • $begingroup$
      Are you looking for implementation example in a certain language? If you look through the jetsetter article and its references, you can see code sample implementation in C#, PHP, etc. Many people are kind enough to share the code via github as well and hopefully one of them would work for you.
      $endgroup$
      – The Lyrist
      Oct 1 '18 at 18:29















    $begingroup$
    Is there a descriptive guide to use this apart from the official jetsetter page?
    $endgroup$
    – thanatoz
    Oct 1 '18 at 7:22





    $begingroup$
    Is there a descriptive guide to use this apart from the official jetsetter page?
    $endgroup$
    – thanatoz
    Oct 1 '18 at 7:22













    $begingroup$
    Are you looking for implementation example in a certain language? If you look through the jetsetter article and its references, you can see code sample implementation in C#, PHP, etc. Many people are kind enough to share the code via github as well and hopefully one of them would work for you.
    $endgroup$
    – The Lyrist
    Oct 1 '18 at 18:29




    $begingroup$
    Are you looking for implementation example in a certain language? If you look through the jetsetter article and its references, you can see code sample implementation in C#, PHP, etc. Many people are kind enough to share the code via github as well and hopefully one of them would work for you.
    $endgroup$
    – The Lyrist
    Oct 1 '18 at 18:29











    0












    $begingroup$

    So, this is a simple problem that could be solved using one-shot learning technique. To achieve this, we must build a model that understands our data and is capable of finding similarity or dissimilarity in your data.



    For this, we must carry out the following steps:



    1. Train (or finetune) the network on dataset of related images.

    2. After training the model, clip the last predicting layers to create embedding.

    3. Pass your testing data through the network and store individual embedding.

    4. Find the difference between the embedding and find the differences crossing a certain threshold.

    5. These images are potentially images having similar data and this could be easily used to find duplicacy in the dataset.

    1shot



    I referred this paper on oneshot learning and later found this blog to be a little helpful.






    share|improve this answer









    $endgroup$

















      0












      $begingroup$

      So, this is a simple problem that could be solved using one-shot learning technique. To achieve this, we must build a model that understands our data and is capable of finding similarity or dissimilarity in your data.



      For this, we must carry out the following steps:



      1. Train (or finetune) the network on dataset of related images.

      2. After training the model, clip the last predicting layers to create embedding.

      3. Pass your testing data through the network and store individual embedding.

      4. Find the difference between the embedding and find the differences crossing a certain threshold.

      5. These images are potentially images having similar data and this could be easily used to find duplicacy in the dataset.

      1shot



      I referred this paper on oneshot learning and later found this blog to be a little helpful.






      share|improve this answer









      $endgroup$















        0












        0








        0





        $begingroup$

        So, this is a simple problem that could be solved using one-shot learning technique. To achieve this, we must build a model that understands our data and is capable of finding similarity or dissimilarity in your data.



        For this, we must carry out the following steps:



        1. Train (or finetune) the network on dataset of related images.

        2. After training the model, clip the last predicting layers to create embedding.

        3. Pass your testing data through the network and store individual embedding.

        4. Find the difference between the embedding and find the differences crossing a certain threshold.

        5. These images are potentially images having similar data and this could be easily used to find duplicacy in the dataset.

        1shot



        I referred this paper on oneshot learning and later found this blog to be a little helpful.






        share|improve this answer









        $endgroup$



        So, this is a simple problem that could be solved using one-shot learning technique. To achieve this, we must build a model that understands our data and is capable of finding similarity or dissimilarity in your data.



        For this, we must carry out the following steps:



        1. Train (or finetune) the network on dataset of related images.

        2. After training the model, clip the last predicting layers to create embedding.

        3. Pass your testing data through the network and store individual embedding.

        4. Find the difference between the embedding and find the differences crossing a certain threshold.

        5. These images are potentially images having similar data and this could be easily used to find duplicacy in the dataset.

        1shot



        I referred this paper on oneshot learning and later found this blog to be a little helpful.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Apr 2 at 5:32









        thanatozthanatoz

        643421




        643421



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38924%2fwhat-methods-can-be-used-to-detect-duplicacy-in-image-dataset%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Marja Vauras Lähteet | Aiheesta muualla | NavigointivalikkoMarja Vauras Turun yliopiston tutkimusportaalissaInfobox OKSuomalaisen Tiedeakatemian varsinaiset jäsenetKasvatustieteiden tiedekunnan dekaanit ja muu johtoMarja VaurasKoulutusvienti on kestävyys- ja ketteryyslaji (2.5.2017)laajentamallaWorldCat Identities0000 0001 0855 9405n86069603utb201588738523620927

            Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task

            Is this part of the description of the Archfey warlock's Misty Escape feature redundant?When is entropic ward considered “used”?How does the reaction timing work for Wrath of the Storm? Can it potentially prevent the damage from the triggering attack?Does the Dark Arts Archlich warlock patrons's Arcane Invisibility activate every time you cast a level 1+ spell?When attacking while invisible, when exactly does invisibility break?Can I cast Hellish Rebuke on my turn?Do I have to “pre-cast” a reaction spell in order for it to be triggered?What happens if a Player Misty Escapes into an Invisible CreatureCan a reaction interrupt multiattack?Does the Fiend-patron warlock's Hurl Through Hell feature dispel effects that require the target to be on the same plane as the caster?What are you allowed to do while using the Warlock's Eldritch Master feature?