Chapter 2 Being Aware

2.1 Five Cornerstones

You will quickly notice that this book is structured quite differently compared to other offerings on the market. Let me reason about that, even before I introduce you to the structure and the transformation that the book offers. I have been a lecturer/trainer for a few years and I always notice that content does not matter as much as the path through the content as well as the assumptions upon which the content is built. Only if both of these are fine, the book/course/training can offer you (the reader/student) a transformation - taking you from where you are, to where you want to be. In this case I am hoping you would like to make a living within Data Science or create some amazing project within it!

This book is hence built on five cornerstones:

  • Data Science is a practice of using data to create a direct benefit.
  • Citizen Data Scientist is becoming a crucial occupation.
  • Anyone can become a Citizen Data Scientist.
  • There are 6 levels of Citizen Data Scientist and these should act as a learning curve.
  • Math is not required as a teaching tool in order to train a Citizen Data Scientist and can be replaced by visualisation and intuition.

Did I go crazy claiming that anyone can become a Data Scientist? And what about the Math which is appraised by everyone as the crucial tool within data science?!? Maybe I have, but these cornerstones (assumptions) worked for me as well as for my students within the past years. Let me first explain why I believe that these things might work out for you.

2.2 Defining Data Science

The very first thing that I would like to do is to answer a simple question: What is data and what is information? The difference between the two is in their value. Data inherently do not have any value for their owner, simply because he/she cannot do anything beneficial upon them. On the other hand, information can be valuable for its owner. These two are however thinnly connected - as information is derived from data, while this process is called Data Science. I will give you an example from my recent past on how I got a value out of Data Science.

I was always interested in my sleep, as I know that in order to live a happy life, I shall sleep well. The only two things I know really is how long I sleep and how do I feel in the morning. I then found an app on my phone which was promising to analyze my sleep and give me recommendations on how to improve my sleep. The app is using accelerometer in my phone, recording my movements at night. During the first two nights, the app only collected data - every morning I was able to see various graphs on how I slept. After a week though, recommendations came such as “when you go to bed late, around 23:00, even if you sleep full 8 hours, your sleep quality suffers.”. This is already information which had a value for me. The app is Data Science process of creating valuable informaton from data.

If you now turn to search engines and start to search for Data Science terms, you will be overwhelmed by articles and (unfortunately) buzzwords - Machine Learning, Data Preprocessing, Artificiall Intelligence, Tensorflow, R, Python. These are at the end only methods and tools of Data Science to achieve a result which we described above. Do not get confused or overwhelmed by these. The book which you are reading is a way how to master these methods, but first I want you to truly and depply understand the cruial things and ideas of Data Science. This might actually be the most important message from this book (that is why it is in the beginning) - always keep in mind that the methods should always serve the purpose which Data Science has.

Data Science is merely an art of turning data into information, which provides benefit.

2.3 Defining Data Scientist

Now that we know what Data Science is, it is quite easy to define who a Data Scientist is. He/She is a practitioner of this field, who uses its methods to achieve its goal. Usually, Data Scientist has general education in three area - Programming, Mathematics and Statistics. Notice that none of these areas deal with context or domain such as Banking, Medicine or Engineering. Thus Data Scientist - by definition, should be domain-agnostic. We can see this also in the definition within Oxford Dictionary:

a person employed to analyse and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making

He/she should be able to join any of mentioned domains, and many more, apply the methods of Data Science to fulfill the goal of creating value, through information, out of data. This is very important point for you as you will now see.

2.4 The Struggle of Data Science

Now that you heard about Data Scienstits, who are Citizen Data Scientists? I think they are the most crucial occupations in the future of the field of Data Science. You heard me right, not the original Data Scientists, but the new generation of Citizen Data Scientists. Let me show you why.

Data Science as a field has been around for decades in one form or another. There has been market analysts, risk modellers, product developers, artificial intelligence developers and so on. If we however relate to the definition of Data Science above in a way that a Data Scientist can essentially come to any company and make use of its data to create benefit, we can go back to the millenia breakthrough. Universities start to offer programs with names like “Data Analysis”, “Business Intelligence” and first candidates are leaving these programs. It feels like the field is growing tremedously - growth in GPU powers, Deep Learning, Automated Machine Learning products. Especially since 2010s, job advertisments revolving around Data Science are exploding! Though is the field really growing from the perspective of market? I would say very slowly, simply because the created benefit for companies (and their customers) is growing slowly.

This is caused by the fact that the field is still relying on the slow and linear growth of the first generation of Data Scientists - ones having formal education, being able to guarantee the work. Now two scenarios can occur when a company attempts to create benefit/profit through data science:

  1. It hires actual Data Scientists, whereas costs to hire them are tremendous and what is even more expensive is to make them productive. Finding for them process which can be optimized through their methods and then properly optimizing it is lengthy and costly. As these are domain-agnostic, the company needs to also educate them and integrate them into contexts. Company slowly looses interest and stops believing that actual profit in a decent scale can be created this way.
  2. It does not hire actual Data Scientists, as they are expensive and scarce. People who come to the company have been originally trained in other areas and are now expected to be actual Data Scientists. The company has just thrown away a lot of their potential as only their recently acquired experience are counted and not their original context. The hirees are now pushed equally hard as the actual Data Scientists and are struggling to create valuable products. Company slowly looses interest and stops believing that actual profit in decent scale can be created this way.

You might now disagree with me and say I am too much of a negativist. To justify my thinking in your mind, let me ask you only one question. As employees, we are expected to create larger profit than what we cost our employer. How many data scientists you know, who can prove that they are repeatedly creating profit for their companies? Not so many, right? Thereafter I believe that above described struggle holds.

2.5 Who is Citizen Data Scientist

I believe that the answer to this struggle of many companies is recognition of a Citizen Data Scientist (the ones coming in point 2 above) and definition of their relationship to Data Scientists. The market is begging us to do so and we can see it by who comes to job interviews when we make an advertisment for a position of a Data Scientist. These are people who do not have formal education or direct experience. Let me tell you one story from my past.

I worked in large organization with thousands of employees as a Data Scientist. Out of nowhere, I received an email from a colleague I never heard of before. She claimed that she is really passionate about the methods of predictive modelling and Data Science and is eager to apply these methods in her CRM department. We met and she showed me R-script in which she tried a bunch of Machine Learning algorithms and preprocessing methods. What I did in turn was that I encouraged her to continue with the passion because the field has a future and I advised her about further methods she can try.

Only years later, I realized my mistakes in that meeting and that I met one of the first Citizen Data Scientists. I then learned that she left the company as she never managed to get her passion to productive work. I was one of the reasons why she churned because I gave her advice on how to become the first generation Data Scientist, not the second generation of Citizen Data Scientist. Let’s look at the definition by Gartner in 2016:

Citizen data science bridges the gap between mainstream self-service data discovery by business users and the advanced analytics techniques of data scientists.

This colleague of mine was originally in the first part of definition (business user), while I was in the second part of it (advanced analytics). She was trying to become a bridge between the two, without even realising it! Unfortunately I am only smarter years later and when the same situation happens these days, I react in a different way. Also, the company which I work for now is smarter and allows for an environment where Citizen Data Scientists are given space.

She had something that I could never have - expertise in her CRM department and if I was about to give her advice now it would be:

  • Focus on how a predictive model will create benefit in your department. Unless you create a benefit, whatever we code is worthless.
  • Focus on applying in your project your domain knowledge. Only you know it, I don’t.
  • Here is a (short) list of things you should try in your R-script (most of which you already have). You do not need to push for more complex methods and crazy good programming, what you have is already enough. If something more complex is required later on, I will do it for you.

Ehm, exact opposite of what I originally did, right? Citizen Data Scientist therefore stays in his/her original deparment and context, and only to a required (limited) extend enhances her knowledge on Data Science, so that he/she can create the value. He/she posesses edge over (first generation) of Data Scientist in the context knowledge and can hence create more valuable products.

2.6 Value of Citizen Data Scientist

As everyone says it these days - data are everywhere, companies just need to start utilizing them. We can thus take as granted that Data Science will have justification in upcoming years. Though why do companies need Citizen Data Scientists and why they should focus on them instead of the first generation of Data Scientists? Simply because the growth in created benefit will be much greater through

  • Possibility of having higher counts of Citizen Data Scientists, as compared to Data Scientists due to the fact that extensive formal education is not required.
  • Possibility of having more impactful Data Science projects as these will be integrated better to the business, if created by properly trained Citizen Data Scientists.

Citizen Data Scientists do know better what needs to be optimized and in which way, because the context (business area) is their origin. Is it necessary then that every Citizen Data Scientist is on the same level of knowledge? Of course not, that is why I will be basing this book on 6 levels of Citizen Data Scientist, which should only achieve a level suitable for his/her involvement to Data Science. So, do you want to become one of them?

What happens with the original (first generation) Data Scientists? We will still need them, just in slightly different role as until now. They become heavy developers, guarantors and trainers. Whenever a complex problem occurs, which needs technically challenging solution, it will be them who focus on it. The thing is, that in many organizations, there is only a handful of such problems, that is why this makes sense. Secondly, they act as guarantors of solutions developed by Citizen Data Scientists. This again makes sense, because reading through something and commenting on it takes a fraction of time as compared to developing it. Finally, they should act as trainers as there is a need to train, coach and mentor a lot of Citizen Data Scientists.

2.7 Anyone Can Become Citizen Data Scientist

I honestly believe that whomever you are, you can become a Citizen Data Scientist. Why? Because Data Science methods add up to only about half of the requirements for a success, the other half is context knowledge and application. Thereafter whatever your background is, you can become one and create value to your company. Let me list some examples:

  • You are a cashier at a grocery store. There are Data Science projects aimed at optimizing customer experience. Data Scientists without your help will never figure out …add
  • You are biomedical researcher add
  • You are branch advisor for a bank add

2.8 The 6 levels

You can have a great content, great exercises, but unless I am able to keep you motivated during your jurney and you feel like you are growing with every module finished, my course/book is not good. That is why this book sets 6 levels through which you will be growing. With each new level, your power within Data Science will inherently grow and so will your value for employers. This gives you the ultimate opportunity to practice your skills - because you will be able to; in real world! As you will notice, each level holds its definition and a length required (from my perspective) to reach this level from a previous one.

Being Aware
You know what the field is about, what it can and cannot achieve. Learning: 1 hour

Observer
You are able to observe a data science project running, without being able to contribute to it. You are though able to make use of the outputs of a Data Science project as well as help with inputs for such project. Learning: 5 hours

Contributor
You are able to assume a (simple) task(s) within Data Science project, and create a benefit to the project productively. Learning: 30 hours

Statistician
You are able to assume any basic task within a Data Science project and hold responsibility for statistical parts. Learning: 80 hours

Project Responsible
You are able to define Data Science project and execute it. Learning: 40 hours

Butter Knife
You are able to grow Data Science initiative in your organization both effectively (hiring) as well as conceptually (specializations). Learning: 40 hours

Each of these levels is very different both from the content perspective, as well as from desirable approach by you. For example, on a Statistician level, you are going to be focused solely on statistical formulas and their intuition. Be prepared for sitting longer hours, stretching your analytical mind. All of a sudden, this stops and in order to become a Project Responsible, we are going to sharpen your soft skills, such as how to organize a team effectively. It will be needed to free your mind and think about people instead of code.

This book unfortunately cannot cover all that you should now on each of these levels. What the book intends is to give you all intuition and awareness of concepts which is required on each level.

The reason why I decided to write this way (not covering concepts through technical details and coding), is that I believe from my experience that once the intuition in the concept is achieved, then it will become rather simple for you to apply it.

2.9 Real World Experience

What is the biggest struggle of anyone who would like to become a (Citizen) Data Scientist? To get real world experience and practice. Whatever people claim within their online courses and trainings, the trainer will never be able to offer you what is awaiting for you in the real world of Data Science. Even I don’t claim it about my trainings. The only truthful way is hence to go to a real project, that is why I created the levels, so that at each you have it as easy as possible to get to real world problems. Let me show you an example of why I decided to write and teach in this way.

You come to a job interview (external or internal one) and say that you have been learning and would like to become part of this Data Science project. The conversation might go as follows:

Interviewer: Great to see your interest. So how can you contribute to our projects and help out?
You: Well, I learned a bit of Python, I am able to put together some basic statistical model and also do some data preprocessing.

Let me stop here for a second and explain to you what is happening behind this scene of this situation, which is happening probably hundreds of times every day. You are having hard time selling yourself and interviewer is having hard time evaluating your qualities. You are both stuck, due to simple reason - you have listed skills instead of capability to contribute.

Now let’s do a similar conversation and we will reuse one of the levels from above as your answer:

Interviewer: Great to see your interest. So how can you contribute to our projects and help out?
You: I am able to act as an “Observer” to your Data Science projects. This means that I cannot directly contribute to it’s productive code pipeline, but I am able to help out with inputs and outputs. Due to my extensive background in Retail Banking, I know what features might be interesting to collect about customers and used in models and also how to apply outputs of your project in Marketing Campaigns.

Do you feel the difference between the two conversations? Instead of talking about skills, you started to talk about a possible contribution which you can do. You also very clearly drew a line of what you cannot do and the Interviewer will have it easier evaluating you. Moreover, you had a space to relate to your previous field - we all have something where we are coming from. This is what really matters for Citizen Data Scientist - show possible contribution that will be impacting the project and benefit out of it directly.

2.10 Learning Without Math

Mathematics is undoubtedly the foundation of many fields, Data Science being one of them. If one masters it, incredible beauties will uncover in a lot of mathematical formulae and deep understanding of many concepts can be achieved. It has only one problem…“it’s f_____g hard to learn!!!”. I personally know only two kinds of people. First group are ones who are comfortable with math - they liked the subject since many years and have the incredible patience to learn it. Most of the times they also invested into some form of formal education within math. The second group are people who don’t like it and whenever they meet it, they search for intuitions and workarounds just to get the piece of work done. The share of people who belong to the latter group among my friends is 95% and that could be a sad fact for a field like Data Science which would like to grow, while being based heavily on Math. Or is it?

In my perspective, the first generation of Data Scientists are primarily the ones who are fine with map and the second generation belongs to the non-likers. Now let me tell you, it’s perfectly fine and natural to not like math and not be comfortable with it. Who should adjust? Should it be 95% of population or teaching methods of Data Science? Let me bet on the latter…

That is why this book will not teach anything through mathematics. I honestly believe that in order to train a Citizen Data Scientist, math is not only un-needed but also un-recommended, based on my argumentation above.

P.S. There will be a few statistical concepts, so don’t blame me if they look like math. As I promised though, there will be no formula, just intuition.

2.11 Level 1 Reached

Congratulations! Without even realizing it, you have reached the first level which I call “Being Aware”. You are now aware of what Data Science is, what is isn’t, what it struggles with and what is the role of Citizen Data Scientist. Here are some key takeaways which you should have from this chapter:

  1. Remember the five cornerstones upon which this book is built. If you plan on continue reading it, these will be helpful to keep in mind.
  2. Data Science is an art of creating valuable information out of data.
  3. No training or online course can really prepare you for real world. Hopefully this book will help you to go to field as soon as possible and apply yourself, based on a level which you reached.