Chapter 3 The Text of Quran and Rules

I strongly advise readers to skip this preparation chapter and start reading the coding evidences chapters of the 19 System, especially from the chapters Chapter 4 and Chapter 5. This whole chapter has technical details about the used text and the rules and so on, which you ca later revisit and read when you have more time and energy.

3.1 About the Text of Quran

Hafs mushaf, which is used by around 97% of Muslims around the world, was used for the text analysis of Quran in this book. Under this chapter, I will discuss all the details about the text of Quran I used to perform text analysis over it. In 2019, I first started analyzing possible 19 based codings over the text of Quran to see if there is a system based on rules. I discovered many interesting codings that suggested a system. I finished the analysis and wrote the book and published it online in September 7, 2019. The server where I downloaded the text of Quran is from the Tanzil project. From the Tanzil server at http://tanzil.net/download/, I downloaded and used the “Simple Clean” text type that does not include haraka marks or any pause marks or the other options available in the server. Those options already are not part of the main text. Tanzil is almost the common server for most online Quran related web sites. I did not cherry pick a version of my choice to download. The reason I downloaded the “Simple Clean” text is because, as I did not know Arabic, I searched a tutorial that performs text analysis with R computer programming language over the text of Quran and found this one at (Sharaf 2019b), which was using that “Simple Clean” text type for text analysis. So, the text type was not my choice but I followed the tutorial, which was already using the most suitable text version for the text analysis of Quran as it only contains the letter characters of words but no other helper marks.

When I downloaded the text of Quran from http://tanzil.net/download/ in 2019, the released text version was Version 1.0.2 and that version was released on May 12, 2008 and it was online without any error until I downloaded and used it for the analysis of this book. Sometime after the publication of my book based on that long available text, Version 1.0.2, they decided to make a few “optional” non-essential changes, based on their choice, and released it as text Version 1.1 on February 12, 2021. You can see details about the update on on https://tanzil.net/updates/. They made the particular word, “ba’dama” written as two words as “ba’da ma” instead of the previous form of one word. I have made a research on that particular word for about a year and explained all the details I gathered and concluded about it on Chapter 8.1. Upon my research I concluded that “ba’dama” should be written as one word and decided to continue using the text Version 1.0.2 of Quran. You can see my long research summary on the topic on Chapter 8.1. You can currently download it from the github account of this book on https://github.com/quran2019/Quran19/blob/master/quran-simple-clean.txt.

Let’s not distract further from the book and continue discussing on the text analysis and the topics around it. Interested people mostly ask about the text I used and the reasons why I did not use a classical paper printed ones with the haraka helper marks. First of all, I did not cherry picked this text, as I mentioned before, I found and followed this tutorial (textminingthequran.com/tutorial/quran.html), where the author used this simple text of Quran for his own text analysis. However, it turned out to be that this manuscript is the most suitable according to my investigations. Because, I wanted to count only the natural numbers of the text, which are number of chapters, verses, words and letters. I did not need the helper added marks along with the text and this text appears to be just right for this. The author of the tutorial did his PhD in around the text analysis of Quran and it is reasonable to consider that he might have selected this text as it is the most suitable one for the text analysis of Quran.

Among almost all the written old or modern Hafs mushafs, the only difference should be when we count the number of letters as some words might be written slightly differently, which does not change the meaning of the word. Scribal errors are also known facts that might exist in the earlier hand written manuscripts. When it comes to the earlier manuscripts, there is no single reference manuscript to be the reference for all the rest of the manuscripts that follows it up to the last letter of it. The famous early Uthmanic mushafs might be considered as the reference in general but there is no original copy of the first written ones and the currently available ones are “at best, the copies of the copies of the copies of the original ones, written at least 4 generations later”, as Dr.Shehzad Saleem replied to my question during his interview with me about the history of Quran (Saleem and Altay 2020). Since the main medium of the transmission of Quran is via oral transmission with the tradition of memorizing Quran by the large number of dedicated people, called Hafez, who memorize Quran in its entirety, any written text that precisely follows the classic Arabic grammar rules and also approved by Muslim scholars as fully representing Quran might be used to see whether there is a systematic coding design over it. Basically, the hypothesis is that “there might be a rule based system in the text of Quran, if the text represents Quran precisely and fully with the condition that the text is written with respect to classic Arabic grammar rules”. Because, we learn from Quran that it is written in Arabic and therefore we should have the text written in classic Arabic with its grammar rules.

Quran, 12.2: “Indeed, We have sent it down as an Arabic Quran so that you may understand.” (Translations are from quran.com, the Clear Quran).

Quran, 41.44: “Had We revealed it as a non-Arabic Quran, they would have certainly argued, “If only its verses were made clear ˹in our language˺. What! A non-Arabic revelation for an Arab audience!” Say, ˹O Prophet,˺ “It is a guide and a healing to the believers. As for those who disbelieve, there is deafness in their ears and blindness to it ˹in their hearts˺. It is as if they are being called from a faraway place.” ”

Quran, 13.37: “And so We have revealed it as an authority in Arabic. And if you were to follow their desires after ˹all˺ the knowledge that has come to you, there would be none to protect or shield you from Allah.”

Therefore, in my opinion, any old or modern, text that is written in the classic Arabic grammar and accepted as fully representing the Quran (Hafs recitation/mushaf) by scholars might be considered for the text analysis to see if there is a meaningful rule based systematic coding design over the text that suggest that the text is intact and unchanged. In the end, the recitation (the Zikr / the Reminder) was declared to be protected by God but not the text of the book in the verse 15:9.

Quran, 15.9: “It is certainly We Who have revealed the Reminder, and it is certainly We Who will preserve it.”

Without being aware of much of the above information in the beginning of my study for this book in the early 2019, I had hypothesized that, the text of Quran can well be designed by God to the last letter of it and I discovered the codes that includes all the letters information as can be seen in Chapter 4.2 and in some other chapters. I presented them separately from the main evidences, which are presented in Chapter 4, Chapter 4.1.2 and also Chapter 5, which do not include letter information but just word information regarding the content and thus might be considered as universal as it is applicable to all the common Hafs mushafs from history to our current date. Because, although the writing of the words might slightly change, all the words are the same as they do not change the meaning of the words. Because of this fact, the codes that includes letter information are always considered weaker compared to the codes that do not include letter information as the data without letters information can be considered as intact and reliable for all the common Hafs mushafs that have been used by the majority of Muslims till now.

The codes with letters information are considered valid only if there is no apparent error about a letter with respect to the classic Arabic language. We know there are various ways of writing the same words with slightly different letters in some cases though it gives the same meaning when it comes to Hafs recitation. Although people recite Quran correctly, they overlook those letter changes among the texts and just ignore them as they know how to recite the word anyways. In case if there is any single letter is found to be apparently erroneous in the text I used, then, most probably, all the letter incorporating codings that I present in this book should collapse. So, feel free to reach out to me if you are 100% sure that there is an error in any of the 332837 total letters of the text of Quran manuscript I used. Nonetheless, make sure to consult with an expert on Arabic language before coming to conclusion on this as there are already known variations in writings of some of the words. Because of these variations in writings, I consider the codings in my book that includes the letters information as weaker evidences compare to others. Since the first publication of the book, I never received solid mathematical objection to the codes I presented but the ones who tried to falsify, always first attacked the letters information I used by ignoring the main code evidences. I suggest those kinds of people to ignore all the letters based evidences, which I already designated as weaker, and look at first the main evidences that do not include letters information. Therefore, I suggest to those kinds of readers to read first the main evidences and also the other evidences that do not include letters information and ignore the the letters based codings. If they see no issue with the main ones, then they have already witnessed the 19 system based on chapter, verse and words information, which indicates the intact design over the text and hence suggest that the text is intact and unchanged too. Certainly, these evidences have their own limits in proving the intact and unchanged text of Quran as the evidences can only go into the resolution of the number of words or letters per verse. However, for the ones without negative biases, it should indicate sufficient evidences as there is already one common Hafs manuscript used by Muslims. The question is not to select a manuscript among various text versions but to test whether or not there is a clear rule based system over the full text of the common Hafs manuscript.

Since the letter incorporating evidences that I presented are not the main evidences of this book, anyone who have doubts about the accuracy of the letters might just opt to ignore related chapters. It will not change the fact that the text of Quran appears to have a design up to its number of words in each verses along with chapters and verses as is when considering the main miraculous 19 based system I presented in Chapter 4, Chapter 4.1.2 and Chapter 5.

However, the ones who would like to genuinely consider the letters based evidences of this book might well have some questions considering the feedback I received after the first publication of this book. A question might pop up to some with extreme doubt, whether the volunteer people of the tanzil.net has deliberately made such a design considering the letters based evidences? Well, I know the answer on my side for sure. I discovered all the evidences in this book myself. I cannot know for sure the other side, the volunteer people of tanzil.net, as I do not know them. However, in my opinion, I am personally 100% sure that they did not do such a thing and even more, they cannot do such a design. Because, as I quoted above, they have written the text with respect to classic Arabic grammar rules and the early Hafs manuscripts. The only way, one can claim such an artificial design is if one can spot an artificial rule that is not meaningful and artificial or redundant for the text with respect to the classic Arabic grammar rules. In such a case one might claim that this text has artificial data in it or has letters against the classic Arabic grammar rules that might have been written in it deliberately to make such a design. However, there is no such case that I could observe. I have initially discovered the evidences in 2019 and as they mention in their web site as I quote in the below chapter, they did not receive any typo report since the release of the first version of the text in 2008. So, I see no reason to have that extreme paranoiac doubt. I addressed those extreme potential thoughts anyways in case some extremely skeptical people wonders about them. The other proof that the Tanzil has nothing to do with such 19 based codings is that they changed the writing of one word in their latest updated text in February 2021. They were writing a word as a single word and they changed it into two words after 13 years of their last update since 2008. If their change was correct regarding classic Arabic grammar rules, then the codes with words information in this book would collapse as they mostly include words count information. But after my long research, it turned out to be that they are wrongly changed it after the publication of my book. I discuss on this phenomenal event and my research on the topic in the Chapter 8.1 and Chapter 8.1.1

3.1.1 Fetching Quran’s Text into R

I will mainly show, first, how to fetch the text of Quran into R programming environment, then compute some of the important descriptive natural numbers of the text of Quran but I will not analyze those numbers in this chapter and leave it to the chapters about evidences such as Chapter 4.1.2.

I downloaded the text of Quran from tanzil.net/download in early 2019 to start the analysis about the text of Quran. Arabic speakers do not need punctuation, namely helper marks, and thus there are no punctuation in the early texts as we see today in most modern Quran texts. I am not able to speak in more technical details as I do not know Arabic. After searching the internet, I found this tutorial (textminingthequran.com/tutorial/quran.html) and I followed it for some initial text analysis on the text of Quran. In order to download the text of Quran that I used for the text analysis in this book, as instructed in the mentioned tutorial, we go to the tanzil.net/download and select “Simple Clean” option without any pause marks or other options. For this, deselect “Include pause marks” and “Include sajdah signs” options. These are all marks not part of the simple and pure text but added later to ease the reading. Then select “Text(with aya numbers)” option and click download button. This way you will get a pure text of Quran in your computer with verse (“aya” in Arabic) numbers. In this file, there are “sura” numbers (sura is in Arabic for chapter) followed by verse numbers followed by the actual text of the verses. You can open it in a text editor and better see its text structure. Each field is separated by a bar “|” symbol and there is also some copyright notes at the end of the text, which are all added by tanzil.net as part of their work. Therefore, we remove them first, as you will see below while reading about fetching the downloaded text file into the R programming environment. They do not mention the text version at the download page but when I downloaded in 2019, the text version was Version 1.0.2. All the logs of the updates of the text can be seen in this page:https://tanzil.net/updates. It was released in May 12, 2008 and around 18 months after the publication of my book on the text Version 1.0.2, they updated it on February 12, 2021 as Version 1.1, with a few “optional” changes as they describe in the updates page. In this book, I still use the text version Version 1.0.2 and a long discussion about their updates and its consequences can be read in the Chapter 8.1 and Chapter 8.1.1. In the current chapter, I will continue the technical details of how I analysed the text. So, if you use the tanzil.net/download page to download the text, you can download the Version 1.1 and need to revert back the mentioned few changes back to get the text Version 1.0.2. Or, you can use the text I had downloaded, used and backed up in the github account of this book on this url: https://github.com/quran2019/Quran19/blob/master/quran-simple-clean-v.1.0.2.txt. The text in that link is the Quran text Version 1.0.2 that I had downloaded from tanzil.net/download in early 2019 and used for the text analysis of this book. You can also download it and use it to run the codes in this book over that downloaded text.

Everything is constructed from smaller components. When it comes to the text of a book, the main components are letters, words, verses and chapters. Therefore, I had first searched over these components if there is any 19 based coding design patterns over them. As you will see in the chapters of evidences , there are indeed 19 based coding system over the text of Quran with rules based patterns. I think, I might not have discovered all the system yet but the evidences I provide are sufficient to witness a beautiful and strong 19 based coding system of the text of Quran, which will be contributed by this book to the literature for the first time as further proofs that support the belief that Quran is intact and unchanged from the beginning.

In the following of this chapter, you will see how to fetch the text of Quran into R programming environment and also perform some basic text processing and get the important descriptive numbers of Quran.

An important point to remind is that the structure of the text of Quran is also unique and not the same as any other book we might come across. All the verses are numbered but there are 112 unnumbered and repeated Basmala verse in front of all the 112 chapters out of the 114 chapters. Also the first verse of the first chapter is the numbered Basmala verse. Moreover, Chapter 9 does not have any Basmala verse in contrary to the rest of 113 chapters but interesting enough Chapter 27 has a Basmala verse within its chapter content. So, it looks like there is this non-standard structure and deliberate organization of the text. It leaves the question whether to refer the total descriptive numbers (e.g. verse, words and letters) with respect to only the numbered verses or together (numbered and unnumbered Basmala verses). In my analysis, I analyzed both of them at the same time and concluded that they should be designed together and both types of descriptive numbers of the text are valid because the system I discovered suggests that they were designed together as you will see in the chapters of the evidences. So, I will always refer both of the types of the text of Quran with respect to, first, the 6236 numbered verses only and, second, the 6348 numbered and unnumbered verses together or in other words, all the 6348 verses.

Before going any further, one tiny point I want to mention about the text of Quran is about the punctuation. Arabic speakers do not need them and thus they are already not part of the original text but they were added as helper markers later on for especially non-Arabic speakers such as Turkish speaking people like me. I would not normally even talk about this point but when I had a discussion with one of my non-Muslim friends, he made an argument based on those punctuation marks and claimed that the text of Quran has been changed as other earlier holly books. I was quite surprised that even such a simple fact was being used as an argument and he got this misinformation from the internet as he later searched about this. Anyway, in short, punctuation marks are not part of the original text of Quran and I did not include them in text analysis. If one have doubts about it, she should feel free to speak to an Arabic expert about it.

In order to be able to process and analyze the text of Quran, we need to fetch it into an R console. For that, I utilized this very useful tutorial at (Sharaf 2019b). As mentioned in that tutorial, in 2019, I downloaded the text file of Quran from http://tanzil.net/download/ with “Simple Clean” option without any pause marks or other options, and also with the “Text (with aya numbers)” option selected. I then downloaded text file in a folder with the name “data” under the current working directory. Since they updated the text with a tiny change on February 12, 2021 as explain above, you can only download the text version 1.0.2 from the github link I provided above from repository of this book.

We can run below R script to read Quran text file into R programming environment:

options(tinytex.verbose = TRUE)
tenzil = read.csv("data/quran-simple-clean-v.1.0.2.txt", header=F, stringsAsFactor=F, encoding="UTF-8", sep="|")

#head of the text
head(tenzil)
##   V1 V2                     V3
## 1  1  1 بسم الله الرحمن الرحيم
## 2  1  2  الحمد لله رب العالمين
## 3  1  3          الرحمن الرحيم
## 4  1  4         مالك يوم الدين
## 5  1  5 إياك نعبد وإياك نستعين
## 6  1  6  اهدنا الصراط المستقيم
#tail part of the text
tail(tenzil)
##                                                                         V1 V2
## 6259 #    of the text, and shall be reproduced appropriately in all files  NA
## 6260     #    derived from or containing substantial portion of this text. NA
## 6261                                                                     # NA
## 6262                #  Please check updates at: http://tanzil.net/updates/ NA
## 6263                                                                    #  NA
## 6264 #==================================================================== NA
##      V3
## 6259   
## 6260   
## 6261   
## 6262   
## 6263   
## 6264

It looks, the head of the data is clean and it is a table with three columns (a data frame in R). Each row contains one verse in order. First column contains the chapter numbers (or, in Arabic term, ‘sura’ numbers). Second column has verse numbers (or, in Arabic term, ‘aya’ numbers). Third column has the verses in Arabic.

However, there is some license related information appended to the tail of the text by tanzil.net. Let’s find where to clean it at the tail of the text file.

tenzil[6234:6238,]
##                                                                         V1 V2
## 6234                                                                   114  4
## 6235                                                                   114  5
## 6236                                                                   114  6
## 6237                 # PLEASE DO NOT REMOVE OR CHANGE THIS COPYRIGHT BLOCK NA
## 6238 #==================================================================== NA
##                            V3
## 6234     من شر الوسواس الخناس
## 6235 الذي يوسوس في صدور الناس
## 6236          من الجنة والناس
## 6237                         
## 6238

As we see the last verse appears to be in the row index of 6236. The last chapter (sura) in Quran is the Chapter Nas (sura-al-nas). Since it is very short, almost all Muslims would have memorized it and sometimes recite it in their regular daily prayers. Although, I am no expert in Arabic, I can also recognize it easily from the Arabic writing even without the helper punctuation that we non-Arabic speakers need to be able to read the Arabic text of Quran with the correct pronunciation. Without Arabic knowledge, even I can confidently recognize the last verse of Quran from the verse in the index 6236. So, as instructed in the tutorial, I remove the rest of the last verse to clean the additional general text related information added by tanzil.net. I keep this table as an the R object and named as quran to remember that this object keeps all the words and letters of Quran that we hold in our hand.

quran <- tenzil[1:6236,]
tail(quran)
##       V1 V2                                       V3
## 6231 114  1 بسم الله الرحمن الرحيم قل أعوذ برب الناس
## 6232 114  2                                ملك الناس
## 6233 114  3                                إله الناس
## 6234 114  4                     من شر الوسواس الخناس
## 6235 114  5                 الذي يوسوس في صدور الناس
## 6236 114  6                          من الجنة والناس

As we see, the first column of the table keeps the chapter number, the second column keeps verse number and the third column keeps the text of each verse. Basically, each row of this table keeps the information about one verse in order. Let’s first give their correct names in English to the columns and see the table again.

colnames(quran) = c("chapter", "verse", "text")
head(quran)
##   chapter verse                   text
## 1       1     1 بسم الله الرحمن الرحيم
## 2       1     2  الحمد لله رب العالمين
## 3       1     3          الرحمن الرحيم
## 4       1     4         مالك يوم الدين
## 5       1     5 إياك نعبد وإياك نستعين
## 6       1     6  اهدنا الصراط المستقيم

Let’s also add the row names of the table into the table. Row names should keep the verse order from beginning to end, which we can also test it later on. Since Quran has an order and the order is important, we start given independent verse index numbers from first verse to the last as row names to be able to correctly access them later. I name the column of this independent verse indices as “VerseI” in the table. It is important to remember that this column is given by us independently and the verse indices per chapter that we refer when we quote any specific verse in Quran (such as this formal notation: 74:30).

quran <- cbind(as.numeric(rownames(quran)), quran)
colnames(quran)[1] = "VerseI"
quran$VerseI <- as.numeric(quran$VerseI)
quran$verse <- as.numeric(quran$verse)
quran$chapter <- as.numeric(quran$chapter)
head(quran)
##   VerseI chapter verse                   text
## 1      1       1     1 بسم الله الرحمن الرحيم
## 2      2       1     2  الحمد لله رب العالمين
## 3      3       1     3          الرحمن الرحيم
## 4      4       1     4         مالك يوم الدين
## 5      5       1     5 إياك نعبد وإياك نستعين
## 6      6       1     6  اهدنا الصراط المستقيم

3.1.2 The categories of the main descriptive numbers

It is helpful to define the categories of the main descriptive numbers of the text of Quran to clarify this point. Because I will keep mentioning the categories of the numbers while defining the rules of the coding system in Chapter 3.2 and later present evidences with them.

There are four main descriptive numbers of Quran: the number of chapters, verses, words and letters. All the categories of the descriptive numbers of the text of Quran, except chapters, has numbered type and also the numbered and unnumbered type together versions because of the unique structure and organization of Quran that we observe in the book of Quran in our hands today. Basically, each of the three categories has two types Since the number of chapters has only numbered version, this category has a single number. Therefore, there are 7 main descriptive numbers of the text of Quran. In this chapter, I will compute them blindly via only the computer programming and provide its codes so that you can also reproduce and test those numbers from the text of Quran.

3.1.3 Number of chapters and verses, and sanity checks on the text

Let’s now first check if the indices of verses are in correct order. I will perform two tests here. First by the sum of all the index numbers, second by a simple plot and see if we observe what we expect for. We know that sum of the unique integer numbers from 1 to n is nx(n+1)/2 (by Gauss formula). Therefore, sum of the indices of the verses must be 6236x6237/2 = 19446966. In your computer, make sure its precision is capable of dealing with large numbers. Alternatively, you can also use a big number calculator such as this one (“Good Calculators” 2019).

Now let’s write a code to sum the index values in the table and see if it matches to 19446966 as it must be.

cat("the sum of verse index column VerseI is ",sum(quran$VerseI))
## the sum of verse index column VerseI is  19446966
if(sum(quran$VerseI) == 19446966) 
  print("The sum of the indices of verses are correct and passed this test.")
## [1] "The sum of the indices of verses are correct and passed this test."

The first sum test passed for the indices of the verses. But, still in the middle or any other part of it the indices might be multiple of two numbers or more than the maximum 6236. Let’s see the minimum, which must be 1, maximum, which must be 6236, median, which must be 3118.5 (median of even number) from the data and also most importantly the number of unique indices that must be 6236 too.

print(paste("Minimum, Maximum, Median of VerseI is ",
    min(quran$VerseI),",", max(quran$VerseI),",",median(quran$VerseI)))
## [1] "Minimum, Maximum, Median of VerseI is  1 , 6236 , 3118.5"
print(paste("Number of unique values of VerseI is ",length(unique(quran$VerseI))))
## [1] "Number of unique values of VerseI is  6236"

As we see all are correct as expected. Let’s plot it and see if it is monotonically increasing from 1 to 6236.

plot(quran$VerseI)

The plot is exactly as we expected that is increasing monotonically from chapter 1 to 114. Let’s now check in a similar way, if the chapter numbers are in order and then observe the maximum number.

plot(quran$chapter)

print("Unique chapter numbers regarding the order of text: ")
## [1] "Unique chapter numbers regarding the order of text: "
print(unique(quran$chapter))
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114

The plot is exactly is as we expected as increasing in general but have horizontal lines at some points because of the long chapters.

So, based on the mathematical and visual tests, we conclude that the text of Quran analysed in this book, which was downloaded from tanzil.net, has no apparent error in its text regarding verse and chapter indices and ready for further analysis.

These sanity checks are purely based on mathematical blind computations on the text of Quran. Therefore, they also blindly provide the number of chapters as 114 and the number of verses as 6236. Therefore, these sanity checks provides reproducible tests to count the number of chapters and verses of Quran and they confirm that its number of chapters is 114, which will be denoted by ‘c’, and number of verses is 6236, which will be denoted by ‘v’ for the rest of the book in the R programming codes used. If we include the unnumbered Basmala verses as well, then we get, total number of verses that includes all the numbered verses and all the unnumbered verses (112 Basmalas) together, which is equal to 6236+112=6348 and we denote this number by ‘V’ for future reference in the R codes. We will demonstrate how those numbers are part of 19 based coding system in the Chapter 4.1.2.

c <-  114
v <- nrow(quran) #6236
V <- v+112       #6348

3.1.4 Numbered and Unnumbered Verses of Quran

Regarding its importance for the analysis, I will separately test the numbered verses and the numbered and unnumbered verses and also together as they are two types of the text of Quran. Quran has an extraordinary structure than we used to see in other usual books and in that sense it also stands as unique. There are 114 chapters and they are all numbered from 1 to 114 and ordered deliberately as is in Quran. There are also verses and they are also numbered from 1 to the end of each chapter. For example, the first chapter, also the most famous one, al-Fatiha has chapter number 1 and it has 7 verses and each verses are numbered from 1 to 7. There is no such concept of paragraphs that we used to see in our books but in a sense each sentence or a group of sentences together corresponds to each verse and numbered. This is also very useful when we refer to a specific verse in Quran as we can easily quote two numbers to refer to it precisely. As an example, 19:38 refers to the chapter 19 and verse 38. Some verses are long and some verses are very short.

However, there is another interesting structural situation in the text of Quran. There is a special and repeated verse, Basmala (بسم الله الرحمن الرحيم), which is the first verse of the first chapter, namely 1:1 regarding its formal chapter and verse numbers. The translation of Basmala is “In the name of God, the merciful, the compassionate”. In Quran, this special verse is written before all the chapters except Chapter 9 of Quran and it is recited before start reciting any Quran verse by Muslims. In a sense, it is like a key. This makes Quran as a book consisting of numbered and unnumbered verses. Therefore, in the text analysis of Quran, I will consider two categories that represent the two fundamental structure of it. First one represents the whole Quran, including numbered and unnumbered 6348 verses. Second, represents only the numbered 6236 verses, which means without those repeated unnumbered Basmalas in front of the chapters. Since, these numbered and unnumbered verses are part of Quran, they both might have a role in the 19 based coding system of Quran. I will discuss about this and show some evidences on it in Chapter 4.1.2.

Now, I will generate a second R programming object that keeps only the numbered verses of Quran for further analysis on it. Let’s see first verses of some of the chapters in the main object that keeps the whole Quran.

print(quran$text[quran$verse<=2 & quran$chapter==1])
## [1] "بسم الله الرحمن الرحيم" "الحمد لله رب العالمين"
quran$text[quran$verse<=2 & quran$chapter==2]
## [1] "بسم الله الرحمن الرحيم الم"        "ذلك الكتاب لا ريب فيه هدى للمتقين"
quran$text[quran$verse<=2 & quran$chapter==3]
## [1] "بسم الله الرحمن الرحيم الم"     "الله لا إله إلا هو الحي القيوم"

As we see, tanzil.net has themselves included all the Basmalas inside the first verse of each chapter contrary to the written hard copy texts of Quran that Muslims have since the beginning till now. The tanzil.net probably might have done so for computational reasons to simplify the organization of the text. So, in the following, I will generate the second main table by separating those unnumbered Basmalas and get the table that keeps only the numbered verses of Quran.

require(data.table, quietly = T)
quran <- data.table(quran)
nQuran <- quran
nQuran$text <- gsub("^بسم الله الرحمن الرحيم ","",quran$text)
nQuran$text[nQuran$verse<=2 & nQuran$chapter==1]
## [1] "بسم الله الرحمن الرحيم" "الحمد لله رب العالمين"
nQuran$text[nQuran$verse<=2 & nQuran$chapter==2]
## [1] "الم"                               "ذلك الكتاب لا ريب فيه هدى للمتقين"
nQuran$text[nQuran$verse<=2 & nQuran$chapter==3]
## [1] "الم"                            "الله لا إله إلا هو الحي القيوم"
Basmala <-quran$text[1] #keep Basmala verse for future reference

As we see, in this second object, only first chapter has first verse as Basmala and numbered but other chapters do not have it as this table contains only the numbered verses. Also, I assigned the special verse Basmala into the R object “Basmala” for future reference in this book. So this R object, denoted with nQuran, represents only the numbered verses of Quran. Let’s add one more column that keeps chapter and verse numbers together for easy referencing from the data table that I present below. It is a dynamic table, once can either search or download the analysis ready text for the 6236 numbered verses.

require(data.table, quietly = T)
require(DT)
## Loading required package: DT
datatable(nQuran,
          caption = 'Table head of the 6236 numbered verses of Quran.',
          extensions = c('Buttons'),
          options = list(pageLength = 5, autoWidth = TRUE,
                           dom = 'Blfrtip',buttons = c('excel', 'csv')
                         ), rownames= FALSE)

For future reference and analysis of chapter level text data of the numbered verses text type, I will also generate another table from this table, which holds the chapter indices and the number of verses per chapter as follows. I will assign this table into the R object dfVC for future reference in this book.

require(data.table, quietly = T)
require(DT)
versecomb <- c()
for(j in 1:114){
  i <- which(nQuran$chapter==j)
  versecomb <- c(versecomb, nQuran$verse[i[length(i)] ])
}

dfVC <- data.table(cbind(c(1:114), versecomb))
colnames(dfVC) <- c("Chapter_index","Verse_sum")
datatable(dfVC,
          caption = 'Table: The chapter indices and corresponding sum of verses of numbered 6236 verses.',
          extensions = c('Buttons'),
          options = list(pageLength = 5, autoWidth = TRUE,
                           dom = 'Blfrtip', buttons = c('excel', 'csv')
                         ),rownames= FALSE)

3.1.5 Some text mining and tables of the text

Let’s prepare a more comprehensive data table that keeps some further information about the numbers of text of Quran using text mining tools of R. I utilized the tutorial in (Sharaf 2019a) and the R package (Mullen et al. 2018) to get each word and its frequencies, even the frequencies of the letters of it. I used the R programming language but I also used the ‘tokenizers’ text mining R package (Mullen et al. 2018) to get each word from the text of Quran.

I prefer to keep this book for all readers and thus will not go into details of explaining each lines of the code chunk below. In short, it computes the numbers of words and letters in both types of the text of Quran. I will keep using these R objects in the rest of the book as needed in the coming chapters.

require(tokenizers, quietly = T)
#All words in 6236 numbered verses
words <- unlist(tokenize_words(nQuran$text))
w <-  length(words) # should be 77797
cat("Number of words in 6236 numbered verses is ", w)
## Number of words in 6236 numbered verses is  77797
#number of letters in numbered verses
letters <-  sapply(words, nchar)
l <- sum(letters) #should be 330709
cat("Number of letters in 6236 numbered verses is ", l)
## Number of letters in 6236 numbered verses is  330709
#All words in numbered and unnumbered 6348 verses
Words <- unlist(tokenize_words(quran$text))
W <- length(Words) #should be 78245
cat("Number of words in numbered and unnumbered 6348 verses is ", W)
## Number of words in numbered and unnumbered 6348 verses is  78245
Letters <-  sapply(Words, nchar)
L <- sum(Letters) #should be 332837
cat("Number of letters in numbered and unnumbered 6348 verses is ", L)
## Number of letters in numbered and unnumbered 6348 verses is  332837

Now, we obtained the number of words and letters per verse in both types of the text of Quran. Let’s add this information into the table of the numbered verses of Quran as follows.

vwords<- c()
vletters <- c()
for(i in 1:nrow(nQuran)){
  tmpw <- unlist(tokenize_words(nQuran$text[i]))
  vwords <- c(vwords,length(tmpw))
  vletters <- c(vletters,sum(nchar(tmpw)))
}
nQuran<- cbind(nQuran[,1:3],vwords, vletters,nQuran[,4])
colnames(nQuran)[6] <- "text"

require(data.table)
require(DT)

tmpN <- nQuran
tmpN$CV <- paste(nQuran$chapter,nQuran$verse, sep = ":")


datatable(tmpN,
          caption = 'Table of numbered 6236 verses of Quran',
          extensions = c('Buttons'),
          options = list(pageLength = 5, 
                            autoWidth = TRUE,
                           dom = 'Blfrtip',
                            buttons = c('excel', 'csv')
                         ), rownames= FALSE)
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html