1.2 What is Data Scientist?

Data scientists are a new breed of analytical data expert who have the technical skills to solve complex problems – and the curiosity to explore what problems need to be solved. They are part mathematician, part computer scientist and part business trend-spotter. They straddle in both the business and IT worlds with mathematical and programming weaponry.

The Requisite Skill Set

Data scientist needs a blend of skills in three major areas:

  1. Mathematics
  2. Computing and Software Engineering
  3. Business
Quality of Data Scientists

Figure 1.2: Quality of Data Scientists

Mathematics Narrator

At the heart of mining data insight and building data product is the ability to view the data through a quantitative lens. There are textures, dimensions, and correlations in data that can be expressed mathematically. Finding solutions utilizing data becomes a brain teaser of heuristics and quantitative technique. Solutions to many business problems involve building analytic models grounded in the hard math, where being able to understand the underlying mechanics of those models is key to success in building them.

Also, a misconception is that data science all about statistics. While statistics is important, it is not the only type of math utilized. First, there are two branches of statistics – classical statistics and Bayesian statistics. When most people refer to stats they are generally referring to classical statistics, but knowledge of both types is helpful. Furthermore, many inferential techniques and machine learning algorithms lean on knowledge of linear algebra. For example, a popular method to discover hidden characteristics in a data set is SVD, which is grounded in matrix math and has much less to do with classical stats. Overall, it is helpful for data scientists to have breadth and depth in their knowledge of mathematics.

Computing and Software Engineer Skills

Data is now collected, stored and processed with computer. With the increasing of data quantity, as termed as, we are enter the big data era. The conventional way of processing data facing unprecedented challenge. The personal computer may be not adequate to handle big data. Distributed storage, clouds computing and computer clusters become commonly-used platforms for data access and controls. Basic computing environment configuration and settings are common skills need to handle data.

The data processing tools and languages like R or Python, and a database querying language like SQL are the common used languages in data process and data analyzing. It is also important to have a strong software engineering knowledge so it can be comfortable to handle a large amount of data logging, and to develop data-driven products.

Data scientists need utilizing new technology in order to wrangle enormous data sets and work with complex algorithms, and to code or prototype quick solutions, as well as interact and integrate with complex data systems. Core languages associated with data science include SQL, Python, R, and SAS. On the periphery are Java, Scala, Julia, and others. But it is not just knowing language fundamentals. A data scientist is a technical ninja, able to creatively navigate their way through technical challenges in order to make their code work.

Along these lines, a data science is a solid algorithmic thinker, having the ability to break down messy problems and recompose them in ways that are solvable. This is critical because data scientists operate within a lot of algorithmic complexity. They need to have a strong mental comprehension of high-dimensional data and tricky data control flows. Full clarity on how all the pieces come together to form a cohesive solution.

Strong Business Acumen

It is important for a data scientist to be a tactical business consultant, an operation narrator and story teller. Working so closely with data, data scientists are positioned to learn from data in ways no one else can. They can understand the language the data speak and listen the story the data tells. That creates the responsibility to translate observations, discovery to shared knowledge, and contribute to strategy on how to solve core business problems. This means a core competency of data science is using data to cogently tell a story. No data present a cohesive narrative of problem and solution, using data insights as supporting pillars, that lead to guidance.

Having this business acumen is just as important as having acumen for technology and math and algorithms. There needs to be clear alignment between data science projects and business goals. Ultimately, the value doesn’t come from data, math, and tech itself. It comes from leveraging all of the above to build valuable capabilities and have strong business influence.

How to Become a Data Scientist?

Many people start to Position themselves for a career in data science. Not only for good job opportunities, but also for excitement of work in the technology field with freedom for experimentation and creativity. To get to this position you need solid foundations.

A conventional way of becoming a data scientist is Choosing a university that offers a data science degree. Or register yourself for courses that in data science and analytics fields. If you cannot do these, the option left to you is to learn by yourself.

The knowledge and skills you should have are:

  • Statistics and machine learning. A good understanding of statistics is vital as a data scientist. You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc. Statistics knowledge will also help you understand when different techniques are (or aren’t) a valid approach. Machine learning (ML) is a good weapon when you involve a big data project. Algorithms is the core of machine learning, although many implementations with R or Python libraries do exist and convenient to use, It is still needed a thorough understand how the algorithms works and when when it is appropriate to use different ones.
  • Coding languages such as R or Python. It is essential, a data scientist is competent with a number of computing and data querying languages like R, Python and SQL.
  • Databases such as MySQL and Postgres. Data is generally stored in a Database. it is important to have necessary skills for data access and control from a DBMS systems. The most commonly used DBMS systems are MySql (https://www.mysql.com/) and Postgres (https://www.postgresql.org/) in addition to ACCESS and EXCEL.
  • Visualization and reporting technologies. Visualizing and communicating data is incredibly important, especially with companies that are making data-driven decisions, or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings, or the way techniques work to audiences, both technical and non-technical. Visualization can be immensely helpful. Therefore familiar with data visualization tools like matplotlib, ggplot, or d3.js. Tableau and dashboarding have become a popular data visualization tools. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information.
  • Big data platforms like Hadoop.(https://hadoop.apache.org/) and Spark (https://spark.apache.org/). Although a lot of Data Science project can be tried, or at least prototyped on PC or workstations, it is reality that most large data analyzing is done on advanced computing platforms like distributed infrastructure or computer clusters. these advanced platform mostly deploy Hadoop ecosystems.

If you don’t want to learn these skills on your own, take an online course or enroll in a bootcamp. Like what you do now. It not only provides you opportunity to gain knowledge quickly but also provides you chance of networking with other people who has the similar situation like you do. Connect with other people can lead you into an online community. They all will help you gain fine grain and insider knowledge of solving problems.