Chapter 7 data Management
Because we are either data scientist or data engineers or data analyst or something close to data. R language is specifically tailored towards people working on data so I assume my guess above is correct to some degree. You should also understand a certain principles that go with data management in general.
We won’t be discussing the details but little tricks to work effectively with data.
7.1 Keep a Copy or your Data
Always keep a copy of the original data. While working on the data you can mess up the data to a level where you can’t bring back the old version of the data. So Always keep a backup copy of the data in some folder you like.
This is the primary reason Excel has an undo button but Access don’t. And R has a copy on modify syntax. You will have to explicitly tell R to change the copy of your data otherwise R will never mess up the original copy. Which is excellent for EDA and analysis in general. There is a huge section on copy-on-modify to come so at this moment you just need to know that keep a copy of original data somewhere safe.
You will mess up the data big time so it’s okay to have a backup.
7.2 Don’t use numbers for columns
In R you can use numbers to refer to columns but just because you can doesn’t mean you should. I have seen many people referring to columns of dataframe with numbers. It’s not okay to do that because then you loose the context when you read it.
1, c(1,3:5, 8)]mtcars[
## mpg disp hp drat vs ## Mazda RX4 21 160 110 3.9 0
Can you tell me precisely what columns am I working on? I have seen people writing this type of code in production and get rejected during code-review. Now compare this code to this one.
1, c("mpg","disp","hp","drat", "vs")]mtcars[
## mpg disp hp drat vs ## Mazda RX4 21 160 110 3.9 0
Which one is more readable and more clear to a new user or may be it will help you when after 6 months you will look back at it. Trust me you would not remember the column numbers at all. It’s already hard to remember column names though.
7.3 Keep Meaningful and proper column names
Most of the data we read is from either an excel file or a poorly designed database thus we see column names having Spaces and dot’s and all sort of funny things. Remember these rules they will help you even when you are designing a database and trying to name columns for a data base.
- Assume everything is case sensitive
- Use only lowercase letters, it will help you when you push it in DB
- Don’t use special character except underscore
- Please don’t use spaces at all
There are functions in R like
make.names but that too wouldn’t help you naming your objects properly. When you have very less time go ahead with make.names but other than that please spend at least 1 to 2 hour naming the data. This small exercise will help you for the entire project.
7.4 Use Databases
R has a limitation of RAM while handling data. Even though R has a way of dealing with larger than RAM data I will advice you to use a database instead. I would highly recommend
disk.frame package if you are working on larger than RAM data. It’s based on
data.table and is pretty fast. But that too has limitation and it’s absolutely not a replacement for a database.
There are tons of tools out there from disk.frame to ff to spark and what not but they aren’t a replacement for a real database. Even files stored on your system aren’t a proper replacement of a Database. FIrst and foremost requirement of being a data scientist is the ability to work with a database. Sooner or later you will need it.
Please store your data in a proper database and learn some basics of database architecture. You don’t need to learn everything but a little understanding of
normalization will help you a long way in your journey.
Use a database in all your project and save the credentials in a config file so that you can change them during production. Even if it’s just a prototype it’s okay to start with a database. So that you don’t have to change the code once you migrate. The sooner you move your files to database the better it is for you.
Using databases will bring advantages like:
- Early integration of DB in the project
- Understand the data types of your data as early as possible
- Call only the amount of data you need thus saving RAM
- Push computations to DB with packages like modeldb and dbplyr
- Basic computation like min, max and group by can be done more effectively
- Never worry about loosing your data or corrupting it
There will always be a need for a database in any project what so ever. Please use it as early as possible.
7.5 Use Efficient Packages
Working on data is the most memory consuming task. This is where you should understand the scale or your work. If you are developing something for only 5-10 people then you can get away with anything you do. Even though based on your app or api or product people will judge if R is a good language or not or should we be using R again. Than too I would not advice you to think much for such a small use case. Then when you start talking about 100 to 200 users active through out every second. In this case the difference between
dplyr vs data.table can actually save the day.
There are packages which can help you save memory and speed at the same time. I can’t write an exhaustive list of all the packages but I can point you out to the packages and the situations where they will be helpful.
Any list of efficient packages can’t be started without this name. It’s a package you must learn. We have created apps which were like 3-5 times faster than the python counterpart. The sole reason was we had data.table and they didn’t.
Despite what people say its easy to learn and easy to use in day to day analysis. It doesn’t create a copy of the object and every function of this package is optimized for speed. It’s faster than even spark for in memory datasets. Read that statement again if you didn’t get how huge it is.
I have a book in progress for people trying to learn data.table. This is the link you can read it:
This package has the ability to create and work with sparse matrices. Sparse matrices save a lot of ram and lot of compute power too… When you have any matrix where there are too many 0 or empty values. Sparse matrices are useful in those cases. Learn it and use it if possible.
When you have data that is bigger than ram but still good enough that it can be stored on a disk. disk.frame is a best choice in those scenarios. For example you have a 15GB file on a 8 gb laptop than it’s an excellent candidate. It has it’s uses. However, I would still recommend to use a DB or spark for that matter but disk.frame has it’s own usage.
It’s a package that can directly compute linear regression, logistic regression and some other calculations in the database. What it does is that it creates a long sql statement that basically computes the result of some algorithm and then sends it to a DB and gets the result back. It’s excellent in scenarios when you have too much data and can’t be brought back to R safely. It’s better to use computation of SQL rather than bringing it to R.
It’s a package for creating plot directly from a SQL database. It sends the computation directly to a database and gets the results needed to create a plot. Again it’s useful for circumstances where the data is more than R can handle in RAM. It’s an excellent package for EDA and must be used by any serious practitioner.
Lastly R has an interface to spark directly from R. Folks at Rstudio has made it so easy to use and install that you would feel you are working on a different programming language. It’s definitely worth using when you have truly BIG Data.
This is not an exhaustive list of all the packages available in R to handle memory efficiently. R has many more packages that have an ability to handle particular type of data or situation very well. But these are the must haves for anybody working with data.
This chapter focus mainly focuses on good practices of working with data from R. And how to handle them with efficient packages. To go through what we learned so far.
- keep a copy of your data
- Don’t use numbers for column names
- keep good column names that will help you in entire project
- use databases
- use efficient packages that can properly use computer resources available to you.