How much of data wrangling is a data scientist's job? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest

Statistical analysis applied to methods coming out of Machine Learning

Keep at all times, the minus sign above aligned with minus sign below

How to name indistinguishable henchmen in a screenplay?

Does the main washing effect of soap come from foam?

What does 丫 mean? 丫是什么意思?

Does the transliteration of 'Dravidian' exist in Hindu scripture? Does 'Dravida' refer to a Geographical area or an ethnic group?

Why are current probes so expensive?

What can we say about Classical Nahuatl <z>?

Did pre-Columbian Americans know the spherical shape of the Earth?

How does TikZ render an arc?

What did Turing mean when saying that "machines cannot give rise to surprises" is due to a fallacy?

Does the universe have a fixed centre of mass?

Fit odd number of triplets in a measure?

Why does BitLocker not use RSA?

What was the last profitable war?

Determine whether an integer is a palindrome

Marquee sign letters

As a dual citizen, my US passport will expire one day after traveling to the US. Will this work?

Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?

An isoperimetric-type inequality inside a cube

Any stored/leased 737s that could substitute for grounded MAXs?

What is a more techy Technical Writer job title that isn't cutesy or confusing?

newbie Q : How to read an output file in one command line

How could a hydrazine and N2O4 cloud (or it's reactants) show up in weather radar?



How much of data wrangling is a data scientist's job?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest










42












$begingroup$


I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.



Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".



Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).




1) Feels like most of the work is not related to data science at all. Is this accurate?



2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?



3) Is this type of setup common for a company with serious data science needs?










share|improve this question











$endgroup$











  • $begingroup$
    Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
    $endgroup$
    – jonnor
    Apr 3 at 19:57










  • $begingroup$
    @jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
    $endgroup$
    – Victor Valente
    Apr 3 at 20:12






  • 13




    $begingroup$
    Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
    $endgroup$
    – Nelson
    Apr 4 at 3:05










  • $begingroup$
    If it is a burden on your time you could outsource it.
    $endgroup$
    – Sarcoma
    Apr 4 at 6:23










  • $begingroup$
    I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
    $endgroup$
    – Pedro Henrique Monforte
    Apr 6 at 13:04















42












$begingroup$


I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.



Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".



Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).




1) Feels like most of the work is not related to data science at all. Is this accurate?



2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?



3) Is this type of setup common for a company with serious data science needs?










share|improve this question











$endgroup$











  • $begingroup$
    Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
    $endgroup$
    – jonnor
    Apr 3 at 19:57










  • $begingroup$
    @jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
    $endgroup$
    – Victor Valente
    Apr 3 at 20:12






  • 13




    $begingroup$
    Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
    $endgroup$
    – Nelson
    Apr 4 at 3:05










  • $begingroup$
    If it is a burden on your time you could outsource it.
    $endgroup$
    – Sarcoma
    Apr 4 at 6:23










  • $begingroup$
    I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
    $endgroup$
    – Pedro Henrique Monforte
    Apr 6 at 13:04













42












42








42


15



$begingroup$


I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.



Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".



Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).




1) Feels like most of the work is not related to data science at all. Is this accurate?



2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?



3) Is this type of setup common for a company with serious data science needs?










share|improve this question











$endgroup$




I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.



Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".



Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).




1) Feels like most of the work is not related to data science at all. Is this accurate?



2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?



3) Is this type of setup common for a company with serious data science needs?







data-wrangling






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 5 at 15:56







Victor Valente

















asked Apr 3 at 15:16









Victor ValenteVictor Valente

31528




31528











  • $begingroup$
    Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
    $endgroup$
    – jonnor
    Apr 3 at 19:57










  • $begingroup$
    @jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
    $endgroup$
    – Victor Valente
    Apr 3 at 20:12






  • 13




    $begingroup$
    Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
    $endgroup$
    – Nelson
    Apr 4 at 3:05










  • $begingroup$
    If it is a burden on your time you could outsource it.
    $endgroup$
    – Sarcoma
    Apr 4 at 6:23










  • $begingroup$
    I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
    $endgroup$
    – Pedro Henrique Monforte
    Apr 6 at 13:04
















  • $begingroup$
    Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
    $endgroup$
    – jonnor
    Apr 3 at 19:57










  • $begingroup$
    @jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
    $endgroup$
    – Victor Valente
    Apr 3 at 20:12






  • 13




    $begingroup$
    Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
    $endgroup$
    – Nelson
    Apr 4 at 3:05










  • $begingroup$
    If it is a burden on your time you could outsource it.
    $endgroup$
    – Sarcoma
    Apr 4 at 6:23










  • $begingroup$
    I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
    $endgroup$
    – Pedro Henrique Monforte
    Apr 6 at 13:04















$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57




$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57












$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12




$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12




13




13




$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05




$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05












$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23




$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23












$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 13:04




$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 13:04










9 Answers
9






active

oldest

votes


















24












$begingroup$


  1. Feels like most of the work is not related to data science at all. Is this accurate?



    Yes




  2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



    You're not wrong, but such are the realities of real life.




  3. Is this type of setup common for a company with serious data science needs?



    Yes



From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



From a practical standpoint, two things:



  • First of all, understand what is technically possible. In many cases,
    the people telling you know are IT-illiterate people who worry about
    business or compliance considerations, but have no concept of what is
    and isn't feasible from an IT standpoint. Try to speak to the DBAs or
    to whoever manages the data infrastructure. Understand what is
    technically possible. THEN, only then, try to find a compromise. E.g.
    they won't give you access to their system, but I presume there is a
    database behind it? Maybe they can extract the data to some other
    formats? Maybe they can extract the SQL statements that define the
    data types etc?


  • Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
    $endgroup$
    – Jason
    Apr 4 at 13:35






  • 2




    $begingroup$
    True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
    $endgroup$
    – PythonGuest
    Apr 4 at 13:58


















36












$begingroup$

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




data scientists spend from 50 percent to 80 percent of their time



collecting and preparing unruly digital data.




Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.






share|improve this answer











$endgroup$












  • $begingroup$
    Forbes article claiming the same 80% figure.
    $endgroup$
    – Jesse Amano
    Apr 3 at 19:08






  • 4




    $begingroup$
    Forbes should nowhere be mentioned together with the words "data science".
    $endgroup$
    – gented
    Apr 3 at 22:52










  • $begingroup$
    50-80% based on (quote) "interviews and expert estimates"
    $endgroup$
    – oW_
    Apr 3 at 23:33






  • 3




    $begingroup$
    @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
    $endgroup$
    – Keeta
    Apr 4 at 11:46


















25












$begingroup$


Feels like most of the work is not related to data science at all. Is this accurate?




This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



enter image description here



Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.






share|improve this answer









$endgroup$




















    6












    $begingroup$

    As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



    So to answer:



    1) Yes totally accurate



    2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



    3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").






    share|improve this answer









    $endgroup$




















      6












      $begingroup$


      1. Feels like most of the work is not related to data science at all. Is this accurate?



        Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




      2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



        I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




      3. Is this type of setup common for a company with serious data science needs?



        Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.







      share|improve this answer









      $endgroup$




















        5












        $begingroup$

        If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



        You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



        Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?






        share|improve this answer









        $endgroup$




















          3












          $begingroup$

          Perhaps to put it simply:



          • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

          • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

          You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



          Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



          But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.






          share|improve this answer









          $endgroup$




















            3












            $begingroup$

            1) Feels like most of the work is not related to data science at all. Is this accurate?
            In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



            2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
            Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



            3) Is this type of setup common for a company with serious data science needs?
            I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



            I have given my inputs in thinking of businesses instead of technical stand points. :-)
            Hope I am clear in my choice of words.






            share|improve this answer









            $endgroup$




















              3












              $begingroup$

              In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



              He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
              Clean(e.g.-99/Null),
              Schema mapping (e.g. wages/salary),
              Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



              There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



              Until this area matures, a lot of the data scientist job will indeed be data wrangling.






              share|improve this answer









              $endgroup$













                Your Answer








                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "557"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );













                draft saved

                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48531%2fhow-much-of-data-wrangling-is-a-data-scientists-job%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                9 Answers
                9






                active

                oldest

                votes








                9 Answers
                9






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                24












                $begingroup$


                1. Feels like most of the work is not related to data science at all. Is this accurate?



                  Yes




                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                  You're not wrong, but such are the realities of real life.




                3. Is this type of setup common for a company with serious data science needs?



                  Yes



                From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



                From a practical standpoint, two things:



                • First of all, understand what is technically possible. In many cases,
                  the people telling you know are IT-illiterate people who worry about
                  business or compliance considerations, but have no concept of what is
                  and isn't feasible from an IT standpoint. Try to speak to the DBAs or
                  to whoever manages the data infrastructure. Understand what is
                  technically possible. THEN, only then, try to find a compromise. E.g.
                  they won't give you access to their system, but I presume there is a
                  database behind it? Maybe they can extract the data to some other
                  formats? Maybe they can extract the SQL statements that define the
                  data types etc?


                • Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...






                share|improve this answer











                $endgroup$








                • 1




                  $begingroup$
                  Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                  $endgroup$
                  – Jason
                  Apr 4 at 13:35






                • 2




                  $begingroup$
                  True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                  $endgroup$
                  – PythonGuest
                  Apr 4 at 13:58















                24












                $begingroup$


                1. Feels like most of the work is not related to data science at all. Is this accurate?



                  Yes




                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                  You're not wrong, but such are the realities of real life.




                3. Is this type of setup common for a company with serious data science needs?



                  Yes



                From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



                From a practical standpoint, two things:



                • First of all, understand what is technically possible. In many cases,
                  the people telling you know are IT-illiterate people who worry about
                  business or compliance considerations, but have no concept of what is
                  and isn't feasible from an IT standpoint. Try to speak to the DBAs or
                  to whoever manages the data infrastructure. Understand what is
                  technically possible. THEN, only then, try to find a compromise. E.g.
                  they won't give you access to their system, but I presume there is a
                  database behind it? Maybe they can extract the data to some other
                  formats? Maybe they can extract the SQL statements that define the
                  data types etc?


                • Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...






                share|improve this answer











                $endgroup$








                • 1




                  $begingroup$
                  Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                  $endgroup$
                  – Jason
                  Apr 4 at 13:35






                • 2




                  $begingroup$
                  True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                  $endgroup$
                  – PythonGuest
                  Apr 4 at 13:58













                24












                24








                24





                $begingroup$


                1. Feels like most of the work is not related to data science at all. Is this accurate?



                  Yes




                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                  You're not wrong, but such are the realities of real life.




                3. Is this type of setup common for a company with serious data science needs?



                  Yes



                From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



                From a practical standpoint, two things:



                • First of all, understand what is technically possible. In many cases,
                  the people telling you know are IT-illiterate people who worry about
                  business or compliance considerations, but have no concept of what is
                  and isn't feasible from an IT standpoint. Try to speak to the DBAs or
                  to whoever manages the data infrastructure. Understand what is
                  technically possible. THEN, only then, try to find a compromise. E.g.
                  they won't give you access to their system, but I presume there is a
                  database behind it? Maybe they can extract the data to some other
                  formats? Maybe they can extract the SQL statements that define the
                  data types etc?


                • Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...






                share|improve this answer











                $endgroup$




                1. Feels like most of the work is not related to data science at all. Is this accurate?



                  Yes




                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                  You're not wrong, but such are the realities of real life.




                3. Is this type of setup common for a company with serious data science needs?



                  Yes



                From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.



                From a practical standpoint, two things:



                • First of all, understand what is technically possible. In many cases,
                  the people telling you know are IT-illiterate people who worry about
                  business or compliance considerations, but have no concept of what is
                  and isn't feasible from an IT standpoint. Try to speak to the DBAs or
                  to whoever manages the data infrastructure. Understand what is
                  technically possible. THEN, only then, try to find a compromise. E.g.
                  they won't give you access to their system, but I presume there is a
                  database behind it? Maybe they can extract the data to some other
                  formats? Maybe they can extract the SQL statements that define the
                  data types etc?


                • Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 9 at 1:54









                Toros91

                2,0142829




                2,0142829










                answered Apr 4 at 12:29









                PythonGuestPythonGuest

                2562




                2562







                • 1




                  $begingroup$
                  Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                  $endgroup$
                  – Jason
                  Apr 4 at 13:35






                • 2




                  $begingroup$
                  True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                  $endgroup$
                  – PythonGuest
                  Apr 4 at 13:58












                • 1




                  $begingroup$
                  Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                  $endgroup$
                  – Jason
                  Apr 4 at 13:35






                • 2




                  $begingroup$
                  True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                  $endgroup$
                  – PythonGuest
                  Apr 4 at 13:58







                1




                1




                $begingroup$
                Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                $endgroup$
                – Jason
                Apr 4 at 13:35




                $begingroup$
                Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
                $endgroup$
                – Jason
                Apr 4 at 13:35




                2




                2




                $begingroup$
                True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                $endgroup$
                – PythonGuest
                Apr 4 at 13:58




                $begingroup$
                True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
                $endgroup$
                – PythonGuest
                Apr 4 at 13:58











                36












                $begingroup$

                This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



                In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




                data scientists spend from 50 percent to 80 percent of their time



                collecting and preparing unruly digital data.




                Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



                Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



                What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.






                share|improve this answer











                $endgroup$












                • $begingroup$
                  Forbes article claiming the same 80% figure.
                  $endgroup$
                  – Jesse Amano
                  Apr 3 at 19:08






                • 4




                  $begingroup$
                  Forbes should nowhere be mentioned together with the words "data science".
                  $endgroup$
                  – gented
                  Apr 3 at 22:52










                • $begingroup$
                  50-80% based on (quote) "interviews and expert estimates"
                  $endgroup$
                  – oW_
                  Apr 3 at 23:33






                • 3




                  $begingroup$
                  @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                  $endgroup$
                  – Keeta
                  Apr 4 at 11:46















                36












                $begingroup$

                This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



                In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




                data scientists spend from 50 percent to 80 percent of their time



                collecting and preparing unruly digital data.




                Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



                Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



                What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.






                share|improve this answer











                $endgroup$












                • $begingroup$
                  Forbes article claiming the same 80% figure.
                  $endgroup$
                  – Jesse Amano
                  Apr 3 at 19:08






                • 4




                  $begingroup$
                  Forbes should nowhere be mentioned together with the words "data science".
                  $endgroup$
                  – gented
                  Apr 3 at 22:52










                • $begingroup$
                  50-80% based on (quote) "interviews and expert estimates"
                  $endgroup$
                  – oW_
                  Apr 3 at 23:33






                • 3




                  $begingroup$
                  @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                  $endgroup$
                  – Keeta
                  Apr 4 at 11:46













                36












                36








                36





                $begingroup$

                This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



                In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




                data scientists spend from 50 percent to 80 percent of their time



                collecting and preparing unruly digital data.




                Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



                Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



                What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.






                share|improve this answer











                $endgroup$



                This is a situation that many blogs, companies and papers acknowledge as something real in many cases.



                In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it




                data scientists spend from 50 percent to 80 percent of their time



                collecting and preparing unruly digital data.




                Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights



                Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.



                What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 4 at 12:37









                Stephen Rauch

                1,52551330




                1,52551330










                answered Apr 3 at 16:35









                TasosTasos

                1,59511138




                1,59511138











                • $begingroup$
                  Forbes article claiming the same 80% figure.
                  $endgroup$
                  – Jesse Amano
                  Apr 3 at 19:08






                • 4




                  $begingroup$
                  Forbes should nowhere be mentioned together with the words "data science".
                  $endgroup$
                  – gented
                  Apr 3 at 22:52










                • $begingroup$
                  50-80% based on (quote) "interviews and expert estimates"
                  $endgroup$
                  – oW_
                  Apr 3 at 23:33






                • 3




                  $begingroup$
                  @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                  $endgroup$
                  – Keeta
                  Apr 4 at 11:46
















                • $begingroup$
                  Forbes article claiming the same 80% figure.
                  $endgroup$
                  – Jesse Amano
                  Apr 3 at 19:08






                • 4




                  $begingroup$
                  Forbes should nowhere be mentioned together with the words "data science".
                  $endgroup$
                  – gented
                  Apr 3 at 22:52










                • $begingroup$
                  50-80% based on (quote) "interviews and expert estimates"
                  $endgroup$
                  – oW_
                  Apr 3 at 23:33






                • 3




                  $begingroup$
                  @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                  $endgroup$
                  – Keeta
                  Apr 4 at 11:46















                $begingroup$
                Forbes article claiming the same 80% figure.
                $endgroup$
                – Jesse Amano
                Apr 3 at 19:08




                $begingroup$
                Forbes article claiming the same 80% figure.
                $endgroup$
                – Jesse Amano
                Apr 3 at 19:08




                4




                4




                $begingroup$
                Forbes should nowhere be mentioned together with the words "data science".
                $endgroup$
                – gented
                Apr 3 at 22:52




                $begingroup$
                Forbes should nowhere be mentioned together with the words "data science".
                $endgroup$
                – gented
                Apr 3 at 22:52












                $begingroup$
                50-80% based on (quote) "interviews and expert estimates"
                $endgroup$
                – oW_
                Apr 3 at 23:33




                $begingroup$
                50-80% based on (quote) "interviews and expert estimates"
                $endgroup$
                – oW_
                Apr 3 at 23:33




                3




                3




                $begingroup$
                @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                $endgroup$
                – Keeta
                Apr 4 at 11:46




                $begingroup$
                @gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
                $endgroup$
                – Keeta
                Apr 4 at 11:46











                25












                $begingroup$


                Feels like most of the work is not related to data science at all. Is this accurate?




                This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



                enter image description here



                Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.






                share|improve this answer









                $endgroup$

















                  25












                  $begingroup$


                  Feels like most of the work is not related to data science at all. Is this accurate?




                  This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



                  enter image description here



                  Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.






                  share|improve this answer









                  $endgroup$















                    25












                    25








                    25





                    $begingroup$


                    Feels like most of the work is not related to data science at all. Is this accurate?




                    This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



                    enter image description here



                    Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.






                    share|improve this answer









                    $endgroup$




                    Feels like most of the work is not related to data science at all. Is this accurate?




                    This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf



                    enter image description here



                    Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Apr 3 at 16:47









                    Shamit VermaShamit Verma

                    1,6541414




                    1,6541414





















                        6












                        $begingroup$

                        As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



                        So to answer:



                        1) Yes totally accurate



                        2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



                        3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").






                        share|improve this answer









                        $endgroup$

















                          6












                          $begingroup$

                          As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



                          So to answer:



                          1) Yes totally accurate



                          2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



                          3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").






                          share|improve this answer









                          $endgroup$















                            6












                            6








                            6





                            $begingroup$

                            As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



                            So to answer:



                            1) Yes totally accurate



                            2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



                            3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").






                            share|improve this answer









                            $endgroup$



                            As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.



                            So to answer:



                            1) Yes totally accurate



                            2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).



                            3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Apr 3 at 23:03









                            Oliver HoustonOliver Houston

                            1611




                            1611





















                                6












                                $begingroup$


                                1. Feels like most of the work is not related to data science at all. Is this accurate?



                                  Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




                                2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                                  I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




                                3. Is this type of setup common for a company with serious data science needs?



                                  Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.







                                share|improve this answer









                                $endgroup$

















                                  6












                                  $begingroup$


                                  1. Feels like most of the work is not related to data science at all. Is this accurate?



                                    Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




                                  2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                                    I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




                                  3. Is this type of setup common for a company with serious data science needs?



                                    Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.







                                  share|improve this answer









                                  $endgroup$















                                    6












                                    6








                                    6





                                    $begingroup$


                                    1. Feels like most of the work is not related to data science at all. Is this accurate?



                                      Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




                                    2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                                      I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




                                    3. Is this type of setup common for a company with serious data science needs?



                                      Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.







                                    share|improve this answer









                                    $endgroup$




                                    1. Feels like most of the work is not related to data science at all. Is this accurate?



                                      Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.




                                    2. I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?



                                      I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).




                                    3. Is this type of setup common for a company with serious data science needs?



                                      Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.








                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Apr 4 at 19:40









                                    UnderminerUnderminer

                                    1614




                                    1614





















                                        5












                                        $begingroup$

                                        If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



                                        You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



                                        Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?






                                        share|improve this answer









                                        $endgroup$

















                                          5












                                          $begingroup$

                                          If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



                                          You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



                                          Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?






                                          share|improve this answer









                                          $endgroup$















                                            5












                                            5








                                            5





                                            $begingroup$

                                            If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



                                            You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



                                            Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?






                                            share|improve this answer









                                            $endgroup$



                                            If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.



                                            You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.



                                            Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?







                                            share|improve this answer












                                            share|improve this answer



                                            share|improve this answer










                                            answered Apr 4 at 21:05









                                            WhelibeirenWhelibeiren

                                            742




                                            742





















                                                3












                                                $begingroup$

                                                Perhaps to put it simply:



                                                • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

                                                • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

                                                You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



                                                Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



                                                But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.






                                                share|improve this answer









                                                $endgroup$

















                                                  3












                                                  $begingroup$

                                                  Perhaps to put it simply:



                                                  • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

                                                  • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

                                                  You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



                                                  Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



                                                  But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.






                                                  share|improve this answer









                                                  $endgroup$















                                                    3












                                                    3








                                                    3





                                                    $begingroup$

                                                    Perhaps to put it simply:



                                                    • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

                                                    • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

                                                    You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



                                                    Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



                                                    But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.






                                                    share|improve this answer









                                                    $endgroup$



                                                    Perhaps to put it simply:



                                                    • When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

                                                    • When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

                                                    You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.



                                                    Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).



                                                    But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.







                                                    share|improve this answer












                                                    share|improve this answer



                                                    share|improve this answer










                                                    answered Apr 4 at 22:51









                                                    David MDavid M

                                                    1312




                                                    1312





















                                                        3












                                                        $begingroup$

                                                        1) Feels like most of the work is not related to data science at all. Is this accurate?
                                                        In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



                                                        2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
                                                        Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



                                                        3) Is this type of setup common for a company with serious data science needs?
                                                        I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



                                                        I have given my inputs in thinking of businesses instead of technical stand points. :-)
                                                        Hope I am clear in my choice of words.






                                                        share|improve this answer









                                                        $endgroup$

















                                                          3












                                                          $begingroup$

                                                          1) Feels like most of the work is not related to data science at all. Is this accurate?
                                                          In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



                                                          2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
                                                          Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



                                                          3) Is this type of setup common for a company with serious data science needs?
                                                          I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



                                                          I have given my inputs in thinking of businesses instead of technical stand points. :-)
                                                          Hope I am clear in my choice of words.






                                                          share|improve this answer









                                                          $endgroup$















                                                            3












                                                            3








                                                            3





                                                            $begingroup$

                                                            1) Feels like most of the work is not related to data science at all. Is this accurate?
                                                            In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



                                                            2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
                                                            Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



                                                            3) Is this type of setup common for a company with serious data science needs?
                                                            I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



                                                            I have given my inputs in thinking of businesses instead of technical stand points. :-)
                                                            Hope I am clear in my choice of words.






                                                            share|improve this answer









                                                            $endgroup$



                                                            1) Feels like most of the work is not related to data science at all. Is this accurate?
                                                            In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.



                                                            2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
                                                            Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()



                                                            3) Is this type of setup common for a company with serious data science needs?
                                                            I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)



                                                            I have given my inputs in thinking of businesses instead of technical stand points. :-)
                                                            Hope I am clear in my choice of words.







                                                            share|improve this answer












                                                            share|improve this answer



                                                            share|improve this answer










                                                            answered Apr 5 at 4:33









                                                            user70920user70920

                                                            311




                                                            311





















                                                                3












                                                                $begingroup$

                                                                In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



                                                                He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
                                                                Clean(e.g.-99/Null),
                                                                Schema mapping (e.g. wages/salary),
                                                                Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



                                                                There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



                                                                Until this area matures, a lot of the data scientist job will indeed be data wrangling.






                                                                share|improve this answer









                                                                $endgroup$

















                                                                  3












                                                                  $begingroup$

                                                                  In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



                                                                  He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
                                                                  Clean(e.g.-99/Null),
                                                                  Schema mapping (e.g. wages/salary),
                                                                  Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



                                                                  There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



                                                                  Until this area matures, a lot of the data scientist job will indeed be data wrangling.






                                                                  share|improve this answer









                                                                  $endgroup$















                                                                    3












                                                                    3








                                                                    3





                                                                    $begingroup$

                                                                    In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



                                                                    He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
                                                                    Clean(e.g.-99/Null),
                                                                    Schema mapping (e.g. wages/salary),
                                                                    Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



                                                                    There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



                                                                    Until this area matures, a lot of the data scientist job will indeed be data wrangling.






                                                                    share|improve this answer









                                                                    $endgroup$



                                                                    In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)



                                                                    He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
                                                                    Clean(e.g.-99/Null),
                                                                    Schema mapping (e.g. wages/salary),
                                                                    Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)



                                                                    There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.



                                                                    Until this area matures, a lot of the data scientist job will indeed be data wrangling.







                                                                    share|improve this answer












                                                                    share|improve this answer



                                                                    share|improve this answer










                                                                    answered Apr 6 at 9:49









                                                                    hojusaramhojusaram

                                                                    1311




                                                                    1311



























                                                                        draft saved

                                                                        draft discarded
















































                                                                        Thanks for contributing an answer to Data Science Stack Exchange!


                                                                        • Please be sure to answer the question. Provide details and share your research!

                                                                        But avoid


                                                                        • Asking for help, clarification, or responding to other answers.

                                                                        • Making statements based on opinion; back them up with references or personal experience.

                                                                        Use MathJax to format equations. MathJax reference.


                                                                        To learn more, see our tips on writing great answers.




                                                                        draft saved


                                                                        draft discarded














                                                                        StackExchange.ready(
                                                                        function ()
                                                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48531%2fhow-much-of-data-wrangling-is-a-data-scientists-job%23new-answer', 'question_page');

                                                                        );

                                                                        Post as a guest















                                                                        Required, but never shown





















































                                                                        Required, but never shown














                                                                        Required, but never shown












                                                                        Required, but never shown







                                                                        Required, but never shown

































                                                                        Required, but never shown














                                                                        Required, but never shown












                                                                        Required, but never shown







                                                                        Required, but never shown







                                                                        Popular posts from this blog

                                                                        Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                                                                        Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                                                                        Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High