How much of data wrangling is a data scientist's job? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest

Statistical analysis applied to methods coming out of Machine Learning

Keep at all times, the minus sign above aligned with minus sign below

How to name indistinguishable henchmen in a screenplay?

Does the main washing effect of soap come from foam?

What does 丫 mean? 丫是什么意思？

Does the transliteration of 'Dravidian' exist in Hindu scripture? Does 'Dravida' refer to a Geographical area or an ethnic group?

Why are current probes so expensive?

What can we say about Classical Nahuatl <z>?

Did pre-Columbian Americans know the spherical shape of the Earth?

How does TikZ render an arc?

What did Turing mean when saying that "machines cannot give rise to surprises" is due to a fallacy?

Does the universe have a fixed centre of mass?

Fit odd number of triplets in a measure?

Why does BitLocker not use RSA?

What was the last profitable war?

Determine whether an integer is a palindrome

Marquee sign letters

As a dual citizen, my US passport will expire one day after traveling to the US. Will this work?

Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?

An isoperimetric-type inequality inside a cube

Any stored/leased 737s that could substitute for grounded MAXs?

What is a more techy Technical Writer job title that isn't cutesy or confusing?

newbie Q : How to read an output file in one command line

How could a hydrazine and N2O4 cloud (or it's reactants) show up in weather radar?

How much of data wrangling is a data scientist's job?

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsTools to perform SQL analytics on 350TB of csv dataTechnical name for this data wrangling process? Multiple columns into multi-factor single columnHow do you define the steps to explore the data?Which one is better performer on wrangling big data, R or Python?R Programming rearranging rows and colums from timeline dataHow do I split number string with digit pattern?How to work with string data with a lot of NAs in an aggregation task with RWhat is the difference between 'if the data if of good quality' and 'if the data is tidy'?how to calculate number of datapoints within a given time interval?How to deal with count data in random forest

I'm currently working as a data scientist at a retail company (my first job as a DS, so this question may be a result of my lack of experience). They have a huge backlog of really important data science projects that would have a great positive impact if implemented. But.

Data pipelines are non-existent within the company, the standard procedure is for them to hand me gigabytes of TXT files whenever I need some information. Think of these files as tabular logs of transactions stored in arcane notation and structure. No whole piece of information is contained in one single data source, and they can't grant me access to their ERP database for "security reasons".

Initial data analysis for the simplest project requires brutal, excruciating data wrangling. More than 80% of a project's time spent is me trying to parse these files and cross data sources in order to build viable datasets. This is not a problem of simply handling missing data or preprocessing it, it's about the work it takes to build data that can be handled in the first place (solvable by dba or data engineering, not data science?).

1) Feels like most of the work is not related to data science at all. Is this accurate?

2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that in order to build for a sustainable future of data science projects, minimum levels of data accessibility are required. Am I wrong?

3) Is this type of setup common for a company with serious data science needs?

edited Apr 5 at 15:56

asked Apr 3 at 15:16

Victor Valente

31528

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 13:04

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

edited Apr 5 at 15:56

asked Apr 3 at 15:16

Victor Valente

31528

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 13:04

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

edited Apr 5 at 15:56

asked Apr 3 at 15:16

Victor Valente

31528

1) Feels like most of the work is not related to data science at all. Is this accurate?

3) Is this type of setup common for a company with serious data science needs?

data-wrangling

edited Apr 5 at 15:56

asked Apr 3 at 15:16

Victor Valente

31528

edited Apr 5 at 15:56

asked Apr 3 at 15:16

Victor Valente

31528

edited Apr 5 at 15:56

asked Apr 3 at 15:16

Victor Valente

31528

asked Apr 3 at 15:16

Victor Valente

31528

asked Apr 3 at 15:16

Victor Valente

31528

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 13:04

add a comment |

$begingroup$
Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?
$endgroup$
– jonnor
Apr 3 at 19:57

$begingroup$
@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.
$endgroup$
– Victor Valente
Apr 3 at 20:12

13

$begingroup$
Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.
$endgroup$
– Nelson
Apr 4 at 3:05

$begingroup$
If it is a burden on your time you could outsource it.
$endgroup$
– Sarcoma
Apr 4 at 6:23

$begingroup$
I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 13:04

Did you specify which format you want the information on? And give them instructions on how they can do this with their ERP?

– jonnor
Apr 3 at 19:57

@jonnor Of course. I've been working here for almost two years now, and since day 1 I explained how we could build a better platform for data accessibility. There's strong resistance to changing what the company has been doing for 30 years though.

– Victor Valente
Apr 3 at 20:12

Start tracking your hours and convert it to a cost on how much they're wasting your time converting the TXT back to a usable format. I'll bet you once they have a $ figure, they can get it done.

– Nelson
Apr 4 at 3:05

If it is a burden on your time you could outsource it.

– Sarcoma
Apr 4 at 6:23

I find it confusing that a company would hire a Data Scientist and still be resistant to change. You should show them the amount of wasted time and the danger os keeping data into long TXT files without real security arround it

– Pedro Henrique Monforte
Apr 6 at 13:04

add a comment |

9 Answers
9

active

oldest

votes

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a technical standpoint, you need to look into ETL solutions that can make your life easier. Sometimes one tool can be much faster than another to read certain data. E.g. R's readxl is orders of mangnitudes faster than python's pandas at reading xlsx files; you could use R to import the files, then save them to a Python-friendly format (parquet, SQL, etc). I know you're not working on xlsx files and I have no idea if you use Python - it was just an example.

From a practical standpoint, two things:

First of all, understand what is technically possible. In many cases,
the people telling you know are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 9 at 1:54

Toros91

2,0142829

answered Apr 4 at 12:29

PythonGuest

2562

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

Unfortunately, the real world is not like Kaggle. You don't get a CSV or Excel file that you can just start the Data Exploration with a little bit of cleaning. You need to find the data in a format that is not suitable for your needs.

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch♦

1,52551330

answered Apr 3 at 16:35

Tasos

1,59511138

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_♦
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

This is the reality of any data science project. Google actually measured it and published a paper "Hidden Technical Debt in Machine Learning Systems" https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,6541414

add a comment |

As another recent starter in Data Science, I can only add that I don't think you're experience is unique, my team of about 10 apparently hasn't done any DS in over a year (one small project that occupied 2 of the team). This is due to the promise of an effective pipeline the team's been working on, but still just isn't quite delivering the data. Apparently retention has been fairly poor in the past and there's continuous promise of a holy-grail MS Azure environment for future DS projects.

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

3) I'm sure there's companies out there who are better than others. If you can't stand it at your current company, 2 years is a decent length of time, start looking for brighter things (be careful how you phrase your desire to leave your current job, something like "looking to work with a more dynamic team" would sound better than "my old company won't give me data").

answered Apr 3 at 23:03

Oliver Houston

1611

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1614

add a comment |

If you look at this from the perspective of "this isn't my job, so why should I do it" then that's a fairly common, general problem not specific to data science. Ultimately, your job is to do whatever the boss tells you to do, but in practice there is little reason for the boss to be dictatorial about this and usually they can be persuaded. Or at least they will give you a sincere explanation of why it has to be that way. But as far as appealing to authority, there is no official definition of "Data Science" that says you can only do at most X% data cleaning. The authority is whoever is paying you, so long as they have the legal right to stop paying you.

You could also look at it from another perspective: Is this a good use of your time? It sounds like you took a job to do some tasks (which you mean by "data science") but you are having to do another thing (which you call "data wrangling"). Job descriptions and personal feelings are a bit beside the point here because there is something more pertinent: The company presumably pays you a good amount of money to do something that only you can do (the data science). But it's having you do other things instead, which could be done by other people who are some combination of more capable, more motivated or less expensive. If the data wrangling could be done by someone making half your salary, then it makes no sense to pay you twice as much to do the same thing. If it could be done faster by someone paid the same salary, the same logic applies. Therefore it is a waste of resources (especially money) to have the company assign this task to you. Coming at it from this perspective, you might find it much easier to make your superiors see your side of things.

Of course, at the end of the day, somebody has to do the data wrangling. It may be that the cheapest, fastest, easiest way of doing it -- the best person for the job, is you. In that case, you're kind of out of luck. You could try to claim it's not part of your contract, but what are the odds they were naive enough to put something that specific in the contract?

answered Apr 4 at 21:05

Whelibeiren

742

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

You need to work with and understand your data - which includes simple stuff from fixing inconsistencies (NULLs, empty strings, "-") to understanding how a piece of data goes from collected to being displayed. Processing it includes knowing the same pieces of information, so it is partially work you would have had to do anyway.

Now, it sounds like this company could benefit from setting up some sort of free MySQL (or similar) instance to hold your data. Trying to be flexible when you're designing your wrangling code is also a good idea - having an intermediate dataset of processed data I think would be useful if you're allowed to (and can't do it in MySQL).

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1312

add a comment |

1) Feels like most of the work is not related to data science at all. Is this accurate?
In my opinion, Data Science cannot pull out from Data wrangling. But, as you said, the question would come on how much percentage of Data Wrangling is required to do by a Data Scientist. It depends on Organizations bandwidth and the person interest in doing such work. In my experience of 15 to 16 years as DS, I always, spent around 60% to 70% in data wrangling activity and spent to a max of 15% of time in real analysis. so take your call.

2) I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?
Again it depends on organization's security policies. They cannot leave everything to you and they have their own security issues to reveal the data to a person who is temporary employee (sorry to use this words :-()

3) Is this type of setup common for a company with serious data science needs?
I feel these kind of companies require most attention from Data Scientists to make feel that data driven modeling is the future to sustain their business. :-)

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered Apr 5 at 4:33

user70920

311

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

He says that there are a number of open problems in this area: Ingest, Transform(e.g. euro/dollar),
Clean(e.g.-99/Null),
Schema mapping (e.g. wages/salary),
Entity consolidation (e.g. Mike Stonebraker/Michael Stonebreaker)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered Apr 6 at 9:49

hojusaram

1311

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48531%2fhow-much-of-data-wrangling-is-a-data-scientists-job%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

9 Answers
9

active

oldest

votes

9 Answers
9

active

oldest

votes

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

First of all, understand what is technically possible. In many cases,
the people telling you know are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 9 at 1:54

Toros91

2,0142829

answered Apr 4 at 12:29

PythonGuest

2562

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

First of all, understand what is technically possible. In many cases,
the people telling you know are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 9 at 1:54

Toros91

2,0142829

answered Apr 4 at 12:29

PythonGuest

2562

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

First of all, understand what is technically possible. In many cases,
the people telling you know are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 9 at 1:54

Toros91

2,0142829

answered Apr 4 at 12:29

PythonGuest

2562

Feels like most of the work is not related to data science at all. Is this accurate?

Yes

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

You're not wrong, but such are the realities of real life.

Is this type of setup common for a company with serious data science needs?

Yes

From a practical standpoint, two things:

First of all, understand what is technically possible. In many cases,
the people telling you know are IT-illiterate people who worry about
business or compliance considerations, but have no concept of what is
and isn't feasible from an IT standpoint. Try to speak to the DBAs or
to whoever manages the data infrastructure. Understand what is
technically possible. THEN, only then, try to find a compromise. E.g.
they won't give you access to their system, but I presume there is a
database behind it? Maybe they can extract the data to some other
formats? Maybe they can extract the SQL statements that define the
data types etc?

Business people are more likely to help you if you can make the case that doing so is in THEIR interest. If they don't even believe in what you're doing, tough luck...

edited Apr 9 at 1:54

Toros91

2,0142829

answered Apr 4 at 12:29

PythonGuest

2562

edited Apr 9 at 1:54

Toros91

2,0142829

edited Apr 9 at 1:54

Toros91

2,0142829

edited Apr 9 at 1:54

Toros91

2,0142829

answered Apr 4 at 12:29

PythonGuest

2562

answered Apr 4 at 12:29

PythonGuest

2562

answered Apr 4 at 12:29

PythonGuest

2562

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

add a comment |

1

$begingroup$
Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.
$endgroup$
– Jason
Apr 4 at 13:35

2

$begingroup$
True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.
$endgroup$
– PythonGuest
Apr 4 at 13:58

Excellent point about finding / buidling an ETL solution. Just need to add: pick a setup you are comfortable with and can easily read / debug. In the early stages of automating tasks, this is even more important than finding the fastest data-slurp tool. If it's gigs of text, it'll likely often run overnight, and your fluency with a tool / framework / language can make the difference between waking up to good data or something you have to start again. Just a single do-over can wipe out any efficiency benefits. Better to be steady with fewer bugs than to go fast and stumble.

– Jason
Apr 4 at 13:35

True. But, also, don't overoptimise. Choose your priorities wisely. If importing the data is a one -off, don't spend days looking for how to reduce the import time from 2 hours to 30 minutes. Etc.

– PythonGuest
Apr 4 at 13:58

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch♦

1,52551330

answered Apr 3 at 16:35

Tasos

1,59511138

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_♦
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch♦

1,52551330

answered Apr 3 at 16:35

Tasos

1,59511138

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_♦
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

add a comment |

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch♦

1,52551330

answered Apr 3 at 16:35

Tasos

1,59511138

This is a situation that many blogs, companies and papers acknowledge as something real in many cases.

In this paper Data Wrangling for Big Data: Challenges and Opportunities, there is a quote about it

data scientists spend from 50 percent to 80 percent of their time

collecting and preparing unruly digital data.

Also, you can read the source of that quote in this article from The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

What you can do is make use of the old data as much as you can and try to adapt the storing of new data in a process that will be easier for you (or a future colleague) to work with.

edited Apr 4 at 12:37

Stephen Rauch♦

1,52551330

answered Apr 3 at 16:35

Tasos

1,59511138

edited Apr 4 at 12:37

Stephen Rauch♦

1,52551330

edited Apr 4 at 12:37

Stephen Rauch♦

1,52551330

edited Apr 4 at 12:37

Stephen Rauch♦

1,52551330

answered Apr 3 at 16:35

Tasos

1,59511138

answered Apr 3 at 16:35

Tasos

1,59511138

answered Apr 3 at 16:35

Tasos

1,59511138

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_♦
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

add a comment |

$begingroup$
Forbes article claiming the same 80% figure.
$endgroup$
– Jesse Amano
Apr 3 at 19:08

4

$begingroup$
Forbes should nowhere be mentioned together with the words "data science".
$endgroup$
– gented
Apr 3 at 22:52

$begingroup$
50-80% based on (quote) "interviews and expert estimates"
$endgroup$
– oW_♦
Apr 3 at 23:33

3

$begingroup$
@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?
$endgroup$
– Keeta
Apr 4 at 11:46

Forbes article claiming the same 80% figure.

– Jesse Amano
Apr 3 at 19:08

Forbes should nowhere be mentioned together with the words "data science".

– gented
Apr 3 at 22:52

50-80% based on (quote) "interviews and expert estimates"

– oW_♦
Apr 3 at 23:33

@gented Opinion based comment about an opinion based survey in an opinion based article placed on an opinion based answer to an opinion based question. Who would have thought you would find this in "Data Science" SE?

– Keeta
Apr 4 at 11:46

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,6541414

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,6541414

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,6541414

Feels like most of the work is not related to data science at all. Is this accurate?

enter image description here

Result of the paper reflects my experience as well. Vast majority of time is spent in acquiring, cleaning and processing data.

answered Apr 3 at 16:47

Shamit Verma

1,6541414

answered Apr 3 at 16:47

Shamit Verma

1,6541414

answered Apr 3 at 16:47

Shamit Verma

1,6541414

answered Apr 3 at 16:47

Shamit Verma

1,6541414

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1611

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1611

add a comment |

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1611

So to answer:

1) Yes totally accurate

2) No you're correct, but it's an uphill battle to get access to the data you want (if it even exists).

answered Apr 3 at 23:03

Oliver Houston

1611

answered Apr 3 at 23:03

Oliver Houston

1611

answered Apr 3 at 23:03

Oliver Houston

1611

answered Apr 3 at 23:03

Oliver Houston

1611

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1614

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1614

add a comment |

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1614

Feels like most of the work is not related to data science at all. Is this accurate?

Wrangling data is most definitely in the Data Scientist job description. At some level you have to understand the data generating process in order to use it to drive solutions. Sure, someone specialized in ETL could do it faster/more efficient, but being given data dumps is not uncommon in the real world. If you don't like this aspect of data science, there may be an opportunity to work more closely with IT resources to get the data properly sourced into a warehouse you have access to. Alternatively, you could find a job that already has data in better order.

I know this is not a data-driven company with a high-level data engineering department, but it is my opinion that data science requires minimum levels of data accessibility. Am I wrong?

I think the minimum level is txt files. If you have access to the data via text files, you should have access to the data in the database (push back on this with superiors).

Is this type of setup common for a company with serious data science needs?

Yes. You are the data SCIENTIST; you are the expert. It is part of your job to educate others on the inefficiencies of the current data structure and how you can help. Data that isn't usable isn't helping anyone. You have an opportunity to make things better and shape the future of the company.

answered Apr 4 at 19:40

Underminer

1614

answered Apr 4 at 19:40

Underminer

1614

answered Apr 4 at 19:40

Underminer

1614

answered Apr 4 at 19:40

Underminer

1614

add a comment |

answered Apr 4 at 21:05

Whelibeiren

742

add a comment |

answered Apr 4 at 21:05

Whelibeiren

742

add a comment |

answered Apr 4 at 21:05

Whelibeiren

742

answered Apr 4 at 21:05

Whelibeiren

742

answered Apr 4 at 21:05

Whelibeiren

742

answered Apr 4 at 21:05

Whelibeiren

742

answered Apr 4 at 21:05

Whelibeiren

742

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1312

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1312

add a comment |

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1312

Perhaps to put it simply:

When creating variables and binning numerics, would you be doing that blindly, or after analysing your data?

When peers review your findings, if they had questions about particular bits of data, would it embarrass you to not know them?

But of course you're still setting up things from scratch. This is not an easy process, but this "learning experience" is at least good to put in your CV.

answered Apr 4 at 22:51

David M

1312

answered Apr 4 at 22:51

David M

1312

answered Apr 4 at 22:51

David M

1312

answered Apr 4 at 22:51

David M

1312

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered Apr 5 at 4:33

user70920

311

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered Apr 5 at 4:33

user70920

311

add a comment |

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered Apr 5 at 4:33

user70920

311

I have given my inputs in thinking of businesses instead of technical stand points. :-)
Hope I am clear in my choice of words.

answered Apr 5 at 4:33

user70920

311

answered Apr 5 at 4:33

user70920

311

answered Apr 5 at 4:33

user70920

311

answered Apr 5 at 4:33

user70920

311

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered Apr 6 at 9:49

hojusaram

1311

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered Apr 6 at 9:49

hojusaram

1311

add a comment |

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered Apr 6 at 9:49

hojusaram

1311

In his talk "Big Data is four different problems", Turing award winner Michael Stonebraker mentions this particular issue as a big problem (video, slides)

There are number of companies/products trying to solve this problem such as Tamr, Alteryx, Trifacta, Paxata, Google Refine working to solve this problem.

Until this area matures, a lot of the data scientist job will indeed be data wrangling.

answered Apr 6 at 9:49

hojusaram

1311

answered Apr 6 at 9:49

hojusaram

1311

answered Apr 6 at 9:49

hojusaram

1311

answered Apr 6 at 9:49

hojusaram

1311

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Je4 2 ufhWW dnsTnG3C l2B6zbly uBqYHRuGqkdeTad

搜尋此網誌

Trjtdtk

9 Answers
9

Your Answer

Post as a guest

9 Answers
9

9 Answers
9

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

9 Answers 9

Your Answer

Sign up or log in

Post as a guest

Post as a guest

9 Answers 9

9 Answers 9

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

9 Answers
9

9 Answers
9

9 Answers
9