1.4 Tools used in Doing a Data Science Project

Data Scientists use traditional statistical methodologies that form the core backbone of Machine Learning algorithms. They also use Deep Learning algorithms to generate robust predictions. Data Scientists use the following tools and programming languages:

R

R (https://www.r-project.org/) is a scripting language that is specifically tailored for statistical computing and data. It is widely used for data analysis, statistical modeling, time-series forecasting, clustering etc. R is mostly used for statistical operations. It also possesses the features of an object-oriented programming language. R is an interpreter based language and is widely popular across multiple industries particularly for doing data Science projects.

Python

Like R, Python (https://www.python.org/) is an interpreter based high-level programming language. Python is a versatile language. It is mostly used for Data Science and Software Development. Python has gained popularity due to its ease of use and code readability. As a result, Python is widely used for Data Analysis, Natural Language Processing, and Computer Vision. Python comes with various graphical and statistical packages like Matplotlib, Numpy, SciPy and more advanced packages for Deep Learning such as TensorFlow, PyTorch, Keras etc. For the purpose of data mining, wrangling, visualizations and developing predictive models, we utilize Python. This makes Python a very flexible programming language.

SQL

SQL stands for Structured Query Language. Data Scientists use SQL for managing and querying data stored in databases. Being able to extract data from databases is the first step towards analyzing the data. Relational Databases are a collection of data organized in tables. We use SQL for extracting, managing and manipulating the data. For example, A Data Scientist working in the banking industry uses SQL for extracting information of customers. While Relational Databases use SQL, NoSQL is a popular choice for non-relational or distributed databases. Recently NoSQL has been gaining popularity due to its flexible scalability, dynamic design, and open source nature. MongoDB, Redis, and Cassandra are some of the popular NoSQL databases.

Hadoop

Big data is another trending term that deals with management and storage of huge amount of data. Data is either structured or unstructured. A Data Scientist must have a familiarity with complex data and must know tools that regulate the storage of massive datasets. One such tool is Hadoop (https://hadoop.apache.org/). While being open-source software, Hadoop utilizes a distributed storage system using a model called MapReduce. There are several other packages in Hadoop together formed a Apache ecosystem, such as Apache Pig, Hive, HBase etc. Due to its ability to process colossal data quickly, its scalable architecture and low-cost deployment, Hadoop has grown to become the most popular software for Big Data.

Tableau

Tableau (https://www.tableau.com/) is a Data Visualization software specializing in graphical analysis of data. It allows its users to create interactive visualizations and dashboards. This makes Tableau an ideal choice for showing various trends and insights of the data in the form of interactable charts such as Treemaps, Histograms, Box plots etc. An important feature of Tableau is its ability to connect with spreadsheets, relational databases, and cloud platforms. This allows Tableau to process data directly, making it easier for the users.

Weka

For Data Scientists looking forward to getting familiar with Machine Learning in action, Weka (https://www.cs.waikato.ac.nz/ml/weka/) is, can be, an ideal option. Weka is generally used for Data Mining but also consists of various tools required for Machine Learning operations. It is completely open-source software that uses GUI Interface making it easier for users to interact with, without requiring any line of code.