What are the ways to partition a large file that does not fit into memory so it can later be fed as training data?2019 Community Moderator ElectionUse liblinear on big data for semantic analysisError::Type of predictors in new data do not match that of the training dataDoes a big data virtual machine machine help in analyzing large file?How to deal with large training data?Working with large datasets pythonCan R + Hadoop overcome R's memory constraints in any case?Plot RDD data using a pyspark dataframe from csv fileWhat is the best statistical measure tool to measure how close data is to fitted regression line if outliers are not fittedHow to extract errors from log file in scala?How to deal with memory insufficient read by pandas in python

Why was the shrink from 8″ made only to 5.25″ and not smaller (4″ or less)

Is it possible to static_assert that a lambda is not generic?

Why is the sentence "Das ist eine Nase" correct?

How seriously should I take size and weight limits of hand luggage?

Where would I need my direct neural interface to be implanted?

Can a virus destroy the BIOS of a modern computer?

Amending the P2P Layer

Two tailed t test for two companies' monthly profits

Placement of More Information/Help Icon button for Radio Buttons

Is it possible for a PC to dismember a humanoid?

Different meanings of こわい

Do Iron Man suits sport waste management systems?

How to add frame around section using titlesec?

What is the fastest integer factorization to break RSA?

Using "tail" to follow a file without displaying the most recent lines

files created then deleted at every second in tmp directory

Is there a hemisphere-neutral way of specifying a season?

Does the Idaho Potato Commission associate potato skins with healthy eating?

How much mains leakage does an Ethernet connection to a PC induce, and what is the operating leakage path?

Processor speed limited at 0.4 GHz

How can a day be exactly 24 hours long?

Theorists sure want true answers to this!

How dangerous is XSS

What is required to make GPS signals available indoors?



What are the ways to partition a large file that does not fit into memory so it can later be fed as training data?



2019 Community Moderator ElectionUse liblinear on big data for semantic analysisError::Type of predictors in new data do not match that of the training dataDoes a big data virtual machine machine help in analyzing large file?How to deal with large training data?Working with large datasets pythonCan R + Hadoop overcome R's memory constraints in any case?Plot RDD data using a pyspark dataframe from csv fileWhat is the best statistical measure tool to measure how close data is to fitted regression line if outliers are not fittedHow to extract errors from log file in scala?How to deal with memory insufficient read by pandas in python










1












$begingroup$


Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?










share|improve this question









$endgroup$











  • $begingroup$
    what is the size of your data? what is the size of memory of your computer?
    $endgroup$
    – honar.cs
    Mar 27 at 6:55










  • $begingroup$
    @honar.cs Oh I'm not using actual data. I was just curious
    $endgroup$
    – edunlimit
    Mar 27 at 23:06















1












$begingroup$


Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?










share|improve this question









$endgroup$











  • $begingroup$
    what is the size of your data? what is the size of memory of your computer?
    $endgroup$
    – honar.cs
    Mar 27 at 6:55










  • $begingroup$
    @honar.cs Oh I'm not using actual data. I was just curious
    $endgroup$
    – edunlimit
    Mar 27 at 23:06













1












1








1





$begingroup$


Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?










share|improve this question









$endgroup$




Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?







machine-learning bigdata






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 26 at 21:02









edunlimitedunlimit

203




203











  • $begingroup$
    what is the size of your data? what is the size of memory of your computer?
    $endgroup$
    – honar.cs
    Mar 27 at 6:55










  • $begingroup$
    @honar.cs Oh I'm not using actual data. I was just curious
    $endgroup$
    – edunlimit
    Mar 27 at 23:06
















  • $begingroup$
    what is the size of your data? what is the size of memory of your computer?
    $endgroup$
    – honar.cs
    Mar 27 at 6:55










  • $begingroup$
    @honar.cs Oh I'm not using actual data. I was just curious
    $endgroup$
    – edunlimit
    Mar 27 at 23:06















$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55




$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55












$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06




$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06










1 Answer
1






active

oldest

votes


















0












$begingroup$

Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.



This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.



The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
load one file to RAM and fed them to classifier until to find the best parameter.



My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.



import numpy as np

X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]

from sklearn.linear_model import LogisticRegression

lr_cly = LogisticRegression()

def stop_train(X_s, y_s, threshold):
scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
return np.mean(scores) > threshold

def iter_train(cly, X, y, threshold=0.99, max_iter=10):
X_s = [X[:50, :], X[50:, :]]
y_s = [y[:50], y[50:]]

iter_times = 0
while iter_times <= max_iter:
print "--------------"
for X, y in zip(X_s, y_s):
cly.fit(X, y)
print cly.score(X, y)
if stop_train(X_s, y_s, threshold):
break
iter_times += 1

iter_train(lr_cly, X, y)





share|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48054%2fwhat-are-the-ways-to-partition-a-large-file-that-does-not-fit-into-memory-so-it%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.



    This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.



    The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
    load one file to RAM and fed them to classifier until to find the best parameter.



    My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.



    import numpy as np

    X = np.random.random((100, 2))
    y = [1 if x[0] > x[1] else 0 for x in X]

    from sklearn.linear_model import LogisticRegression

    lr_cly = LogisticRegression()

    def stop_train(X_s, y_s, threshold):
    scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
    return np.mean(scores) > threshold

    def iter_train(cly, X, y, threshold=0.99, max_iter=10):
    X_s = [X[:50, :], X[50:, :]]
    y_s = [y[:50], y[50:]]

    iter_times = 0
    while iter_times <= max_iter:
    print "--------------"
    for X, y in zip(X_s, y_s):
    cly.fit(X, y)
    print cly.score(X, y)
    if stop_train(X_s, y_s, threshold):
    break
    iter_times += 1

    iter_train(lr_cly, X, y)





    share|improve this answer











    $endgroup$

















      0












      $begingroup$

      Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.



      This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.



      The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
      load one file to RAM and fed them to classifier until to find the best parameter.



      My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.



      import numpy as np

      X = np.random.random((100, 2))
      y = [1 if x[0] > x[1] else 0 for x in X]

      from sklearn.linear_model import LogisticRegression

      lr_cly = LogisticRegression()

      def stop_train(X_s, y_s, threshold):
      scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
      return np.mean(scores) > threshold

      def iter_train(cly, X, y, threshold=0.99, max_iter=10):
      X_s = [X[:50, :], X[50:, :]]
      y_s = [y[:50], y[50:]]

      iter_times = 0
      while iter_times <= max_iter:
      print "--------------"
      for X, y in zip(X_s, y_s):
      cly.fit(X, y)
      print cly.score(X, y)
      if stop_train(X_s, y_s, threshold):
      break
      iter_times += 1

      iter_train(lr_cly, X, y)





      share|improve this answer











      $endgroup$















        0












        0








        0





        $begingroup$

        Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.



        This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.



        The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
        load one file to RAM and fed them to classifier until to find the best parameter.



        My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.



        import numpy as np

        X = np.random.random((100, 2))
        y = [1 if x[0] > x[1] else 0 for x in X]

        from sklearn.linear_model import LogisticRegression

        lr_cly = LogisticRegression()

        def stop_train(X_s, y_s, threshold):
        scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
        return np.mean(scores) > threshold

        def iter_train(cly, X, y, threshold=0.99, max_iter=10):
        X_s = [X[:50, :], X[50:, :]]
        y_s = [y[:50], y[50:]]

        iter_times = 0
        while iter_times <= max_iter:
        print "--------------"
        for X, y in zip(X_s, y_s):
        cly.fit(X, y)
        print cly.score(X, y)
        if stop_train(X_s, y_s, threshold):
        break
        iter_times += 1

        iter_train(lr_cly, X, y)





        share|improve this answer











        $endgroup$



        Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.



        This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.



        The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
        load one file to RAM and fed them to classifier until to find the best parameter.



        My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.



        import numpy as np

        X = np.random.random((100, 2))
        y = [1 if x[0] > x[1] else 0 for x in X]

        from sklearn.linear_model import LogisticRegression

        lr_cly = LogisticRegression()

        def stop_train(X_s, y_s, threshold):
        scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
        return np.mean(scores) > threshold

        def iter_train(cly, X, y, threshold=0.99, max_iter=10):
        X_s = [X[:50, :], X[50:, :]]
        y_s = [y[:50], y[50:]]

        iter_times = 0
        while iter_times <= max_iter:
        print "--------------"
        for X, y in zip(X_s, y_s):
        cly.fit(X, y)
        print cly.score(X, y)
        if stop_train(X_s, y_s, threshold):
        break
        iter_times += 1

        iter_train(lr_cly, X, y)






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 27 at 14:53









        Glorfindel

        1511210




        1511210










        answered Mar 27 at 11:12









        Happy BoyHappy Boy

        1162




        1162



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48054%2fwhat-are-the-ways-to-partition-a-large-file-that-does-not-fit-into-memory-so-it%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

            Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

            Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High