Chapter 2 Folder Structure

2.1 Organizing files

The best way to organize your code is to write a package.

Organizing your code is the first and foremost thing you should learn. Because as the project grows and multiple files are put into a folder it gets harder to navigate the code. A proper folder structure definitely helps in these times. I Just couldn’t emphasis it enough that best way to organize your code is to write a package. But even when you are not planning to write a package. There are best practices to make it readable and make a smooth navigation.

2.2 Create Projects

It’s such a minor thing to say but I still till date see code like this:

setwd("c://myproject_name/")

It was a good practice like 5 - 6 years ago. Now Rstudio has a feature to create project.

new project

Once you create a project it is easier to manage your files and folders and it’s easier to give it somebody as well. It has virtually the same effect but then you can use Rstudio a little better. It’s something I recommend to every user regardless of the skill level.

2.3 Naming files

I data science most common problem is that we don’t change the file names of excel or csv files provided by business people. And most of the time those file names are totally abbreviated with spaces in between and multiple cases like Total Sales Mike 202002-AZ1P2R.csv. This name is useful for the MIS or Business Analyst as they have a different way of organizing files then yours. They might do it because they have to keep a record of different people and have to provide it anytime asked. But as a Data Scientist your work is entirely different. You are not delivering files you are writing code. Let me reiterate this fact YOU ARE WRITING CODE. In most of the scenarios Data Science is more like programming less like science. Even though it has proportion of both of them. Using fundamentals of programming practices will help you out in long term. So change such file names to sales_data_mike_feb2020.csv or something similar. There are no right or wrong names just what makes more sense to a new user.

There is a trick about naming conventions:

  1. use all lower case or upper case ( helps you in never forgetting the cases )
  2. use underscore in between ( Because file names are mostly long Camel Or Pascal cases may confuse users)
  3. make the name as general as possible ( make sure a newcomer should be able to understand it without any problem)
  4. In choosing a name there are no wrong answers only confusing ones

2.4 Folders Based on File-Type

A Very common practice is to keep different file types in different folder. It’s based on a principle called Seperation of concern so that different individual can take care of different parts of a project without worrying about other parts of the project.

One of the main mistake I see people writing code like this.

DBI::dbGetQuery(conn,
"
select 
    count(*) as numbers,
    max(colname) as maxSome,
    min(colname) as minSome,
  from
    tablename
  group by
    col1,
    col2,
    col3
  order by 
    numbers
")

or codes like this.

shiny::HTML(
"
   <p>At Mozilla, we’re a global community of</p>

    <ul> <!-- changed to list in the tutorial -->
      <li>technologists</li>
      <li>thinkers</li>
      <li>builders</li>
    </ul>

    <p>working together to keep the Internet alive and accessible, so people worldwide can be informed contributors and creators of the Web. We believe this act of human collaboration across an open platform is essential to individual growth and our collective future.</p>

    <p>Read the <a href=\"https://www.mozilla.org/en-US/about/manifesto/\">Mozilla Manifesto</a> to learn even more about the values and principles that guide the pursuit of our mission.</p>
"
)

This is a bad coding style. Every time I see this type of code I realize that the person doesn’t believe that either the code will change or It will be extended. There is nothing permanent in the programming neither code, nor frameworks and not even languages. If you keep this type of code in separate SQL files or html files you can easily edit them later, code will be more easier to read and there will be a separation of concern. Tomorrow if you need help in SQL or HTML, a UI designer or a Database designer can look into your code without getting bogged down in R code. It makes bringing more people to the team easier.

2.5 Creating Sub-folders

On bigger projects simple folder structure tend to become more confusing. This is the main concern I have with the data folder every data scientist create and put all the files he has in that single folder. In these scenarios it’s better to have a sub-folder for different file types or may be different roles.

Like you can create sub-folders based on file-types like CSV’s, json, rds etc.. or you can even create sub-folders based on roles or needs… Like all the data related to one tab or one functionality goes in one folder and so on… There has to be a logical consistency in the folder structure. It’s primarily for you to not get lost in your own folders that you created and secondary for people working with you to understand your code and help in places you need help.

2.6 Conclusion

You have to create folders and everything has to be arranged in. Keep everything as organized as you keep your house. There are a certain principles that will help you in it.

  1. Create projects
  2. Name the files properly
  3. Create a file for different language
  4. create sub-folders wherever you fill necessary.