# Chapter 3 About the Text of Quran

In this chapter, I will mainly show, first, how to fetch the text of Quran into R programming environment, then compute some of the important descriptive numbers of the text of Quran but I will not analyse those numbers in this chapter and leave it to the chapters about evidences such as Chapter 5.

Everything is made up from smaller components. When it comes to the text of a book, the main components are letters, words, verses and chapters. Therefore, I first searched over these components if there is any 19 based coding design pattern over them. As you will see in the evidences chapters, there are indeed 19 based coding system over the text of Quran. I think I could not solve all the system yet but the evidences I provide are sufficient to witness a beautiful and strong 19 based coding system of the text of Quran, which will be contributed by this book as further proofs to support the belief that Quran is intact and unchanged from the beginning.

In the following of this chapter, you will see how to fetch the text of Quran into R programming environment and also performs some basic text processing and get the important descriptive numbers of Quran.

An important point to remind is that the structure of the text of Quran is also unique and not the same as any other we might come across. All the verses are numbered but there are 112 unnumbered and repeated Basmala verse in front of all the 112 chapter out of the 114 chapters. Also the first verse of the first chapter is the numbered Basmala verse. Moreover, Chapter 9 does not have any Basmala in contrary to the rest of 113 chapters but interesting enough Chapter 9 has a Basmala verse within the its chapter in the context. So, it looks there is this non-standard structure and deliberate organization. It leave question whether to refer the total descriptive numbers (e.g. verse, words and letters) with respect to only the numbered verses or together (numbered and unnumbered Basmala verses). In my analysis, I analysed both of them and witnessed that they are designed together and both types of descriptive numbers are valid because they designed together as you will see in the evidences. So, I will always refer both of the type of the text of Quran with respect to, first, the numbered verses only and, second, the numbered and unnumbered verses together.

## 3.1 Fetching Quran’s Text into R

In order to be able to process and analyse the text of Quran, we need to fetch it into an R console. For that, I utilized this very useful tutorial at (Sharaf 2019b). As mentioned in that tutorial, I downloaded the text file of Quran from http://tanzil.net/download/ with “Simple Clean” option without any pause marks or other options, and also with the “Text (with aya numbers)” option selected. I then saved downloaded text file in a folder with the name “data” under the current working directory.

We can run this to read Quran text file into R programming environment:

``tenzil = read.csv("data/quran-simple-clean.txt", header=F, stringsAsFactor=F, encoding="UTF-8", sep="|")``

Let’s see the head and tail of this text file to see if there is anything added to describe the text.

``head(tenzil)``
``````##   V1 V2                     V3
## 1  1  1 بسم الله الرحمن الرحيم
## 2  1  2  الحمد لله رب العالمين
## 3  1  3          الرحمن الرحيم
## 4  1  4         مالك يوم الدين
## 5  1  5 إياك نعبد وإياك نستعين
## 6  1  6  اهدنا الصراط المستقيم``````
``tail(tenzil)``
``````##                                                                         V1
## 6259 #    of the text, and shall be reproduced appropriately in all files
## 6260     #    derived from or containing substantial portion of this text.
## 6261                                                                     #
## 6263                                                                    #
## 6264 #====================================================================
##      V2 V3
## 6259 NA
## 6260 NA
## 6261 NA
## 6262 NA
## 6263 NA
## 6264 NA``````

It looks, the head of the data is clean and it is a table with three columns (a data frame in R). Each row contains one verse in order. First column contains the chapter numbers (or, in Arabic term, ‘sura’ numbers). Second column has verse numbers (or, in Arabic term, ‘aya’ numbers). Third column has the verses in Arabic.

However, there is some license related information appended to the tail of the text by tanzil.net. Let’s find where to clean it at the tail of the text file.

``tenzil[6234:6238,]``
``````##                                                                         V1
## 6234                                                                   114
## 6235                                                                   114
## 6236                                                                   114
## 6237                 # PLEASE DO NOT REMOVE OR CHANGE THIS COPYRIGHT BLOCK
## 6238 #====================================================================
##      V2                       V3
## 6234  4     من شر الوسواس الخناس
## 6235  5 الذي يوسوس في صدور الناس
## 6236  6          من الجنة والناس
## 6237 NA
## 6238 NA``````

As we see the last verse appears to be in the row index of 6236. The last chapter (sura) in Quran is the Chapter Nas (sura-al-nas). Since it is very short, almost all Muslims would have memorized it and sometimes recite it in their regular daily prayers. Although, I am no expert in Arabic, I can also recognize it easily from the Arabic writing even without the helper punctuating that we non-Arabic speakers need to be able to read the Arabic text of Quran with the correct pronunciation. Without Arabic knowledge, even I can confidently recognize the last verse of Quran from the verse in the index 6236. So, as instructed in the tutorial, I remove the rest of the last verse to clean the additional general text related information added by tanzil.net. I keep this table as an the R object and named as quran to remember that this object keeps all the words and letters of Quran that we hold in our hand.

``````quran <- tenzil[1:6236,]
tail(quran)``````
``````##       V1 V2                                       V3
## 6231 114  1 بسم الله الرحمن الرحيم قل أعوذ برب الناس
## 6232 114  2                                ملك الناس
## 6233 114  3                                إله الناس
## 6234 114  4                     من شر الوسواس الخناس
## 6235 114  5                 الذي يوسوس في صدور الناس
## 6236 114  6                          من الجنة والناس``````

As we see, the first column of the table keeps the chapter number, the second column keeps verse number and the third column keeps the text of each verse. Basically, each row of this table keeps the information about one verse in order. Let’s first give their correct names in English to the columns and see the table again.

``````colnames(quran) = c("chapter", "verse", "text")
``````##   chapter verse                   text
## 1       1     1 بسم الله الرحمن الرحيم
## 2       1     2  الحمد لله رب العالمين
## 3       1     3          الرحمن الرحيم
## 4       1     4         مالك يوم الدين
## 5       1     5 إياك نعبد وإياك نستعين
## 6       1     6  اهدنا الصراط المستقيم``````

Let’s also add the row names of the table into the table. Row names should keep the verse order from beginning to end, which we can also test it later on. Since Quran has an order and the order is important, we start given independent verse index numbers from first verse to the last as row names to be able to correctly access them later. I name the column of this independent verse indices as “VerseI” in the table. It is important to remember that this column is given by us independently and the verse indices per chapter that we refer when we quote any specific verse in Quran (such as this formal notation: 74:30).

``````quran <- cbind(as.numeric(rownames(quran)), quran)
colnames(quran) = "VerseI"
quran\$VerseI <- as.numeric(quran\$VerseI)
quran\$verse <- as.numeric(quran\$verse)
quran\$chapter <- as.numeric(quran\$chapter)
``````##   VerseI chapter verse                   text
## 1      1       1     1 بسم الله الرحمن الرحيم
## 2      2       1     2  الحمد لله رب العالمين
## 3      3       1     3          الرحمن الرحيم
## 4      4       1     4         مالك يوم الدين
## 5      5       1     5 إياك نعبد وإياك نستعين
## 6      6       1     6  اهدنا الصراط المستقيم``````

## 3.2 The categories of the main descriptive numbers

It is helpful to define the categories of the main descriptive numbers of the text of Quran to clarify this point. Because I will keep mentioning the categories of the numbers while defining the rules of the coding system in Chapter 4.1.2 and later present evidences with them.

There are four main descriptive numbers of Quran: the number of chapters, verses, words and letters. All the categories of the descriptive numbers of the text of Quran, except chapters, has numbered type and also the numbered and unnumbered type together versions because of the unique structure and organization of Quran that we observe in the book of Quran in our hands today. Basically, each of the three categories has two types Since the number of chapters has only numbered version, this category has a single number. Therefore, there are 7 main descriptive numbers of the text of Quran. In this chapter, I will compute them blindly via only the computer programming and provide its codes so that you can also reproduce and test those numbers from the text of Quran.

## 3.3 Number of chapters and verses, and sanity checks on the text

Let’s now first check if the indices of verses are in correct order. I will perform two tests here. First by the sum of all the index numbers, second by a simple plot and see if we observe what we expect for. We know that sum of the unique integer numbers from 1 to n is nx(n+1)/2 (by Gauss formula). Therefore, sum of the indices of the verses must be 6236x6237/2 = 19446966. In your computer, make sure its precision is capable of dealing with large numbers. Alternatively, you can also use a big number calculator such as this one (“Big Number Calculator” 2019).

Now let’s write a code to sum the index values in the table and see if it matches to 19446966 as it must be.

``cat("the sum of verse index column VerseI is ",sum(quran\$VerseI))``
``## the sum of verse index column VerseI is  19446966``
``````if(sum(quran\$VerseI) == 19446966)
print("The sum of the indices of verses are correct and passed this test.")``````
``##  "The sum of the indices of verses are correct and passed this test."``

The first sum test passed for the indices of the verses. But, still in the middle or any other part of it the indices might be multiple of two numbers or more than the maximum 6236. Let’s see the minimum, which must be 1, maximum, which must be 6236, median, which must be 3118.5 (median of even number) from the data and also most importantly the number of unique indices that must be 6236 too.

``````print(paste("Minimum, Maximum, Median of VerseI is ",
min(quran\$VerseI),",", max(quran\$VerseI),",",median(quran\$VerseI)))``````
``##  "Minimum, Maximum, Median of VerseI is  1 , 6236 , 3118.5"``
``print(paste("Number of unique values of VerseI is ",length(unique(quran\$VerseI))))``
``##  "Number of unique values of VerseI is  6236"``

As we see all are correct as expected. Let’s plot it and see if it is monotonically increasing from 1 to 6236.

``plot(quran\$VerseI)`` The plot is exactly as we expected that is increasing monotonically from chapter 1 to 114. Let’s now check in a similar way, if the chapter numbers are in order and then observe the maximum number.

``plot(quran\$chapter)`` ``print("Unique chapter numbers regarding the order of text: ")``
``##  "Unique chapter numbers regarding the order of text: "``
``print(unique(quran\$chapter))``
``````##      1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##    18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##    35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##    52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##    69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##    86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102
##  103 104 105 106 107 108 109 110 111 112 113 114``````

The plot is exactly is as we expected as increasing in general but have horizontal lines at some points because of the long chapters.

So, based on the mathematical and visual tests, we conclude that the text of Quran analysed in this book, which was downloaded from tanzil.net, has no apparent error in its text regarding verse and chapter indices and ready for further analysis.

These sanity checks are purely based on mathematical blind computations on the text of Quran. Therefore, they also blindly provide the number of chapters as 114 and the number of verses as 6236. Therefore, these sanity checks provides reproducible tests to count the number of chapters and verses of Quran and they confirm that its number of chapters is 114, which will be denoted by ‘c’ for the rest of the book, and number of verses is 6236, which will be denoted by ‘v’ for the rest of the book in the R programming environment. If we include the unnumbered Basmala verses as well, then we get, an optional, total number of verses that includes all the numbered verses and all the unnumbered verses (112 Basmalas) together, which is equal to 6236+112=6348 and we denote this number by ‘V’ for future reference. We will demonstrate how those numbers are part of 19 based coding system in the Chapter 5.

``````c <-  114
v <- nrow(quran) #6236
V <- v+112       #6348``````

## 3.4 Numbered and Unnumbered Verses of Quran

Regarding its importance for the analysis, I will separately address the numbered verses and the numbered and unnumbered verse together as the types of the text of Quran. Quran has an out of ordinary structure than we used to see in other usual books and in that sense it also stands as unique. There are 114 chapters and they are all numbered from 1 to 114 and ordered deliberately as is in Quran. There are also verses and they are also numbered from 1 to the end of the verses of each chapter. For example, the first chapter, also the most famous one, al-Fatiha has chapter number 1 and it has 7 verses and each verses are numbered from 1 to 7. There is no such concept of paragraphs that we used read in our books but in a sense each sentence or a group of sentences together corresponds to each verse and numbered. This is very useful when we refer to a specific verse in Quran as we can easily quote two numbers to refer to it precisely. As an example, 19:38 refers to the chapter 19 and verse 38. Some verses are long and some verses are very short.

However, there is another interesting structural situation in the text of Quran. There is a special verse, Basmala (بسم الله الرحمن الرحيم), which is the first verse of the first chapter, namely 1:1 regarding its formal chapter and verse numbers. The translation of Basmala is “In the name of God, the merciful, the compassionate”. In Quran, this special verse is written before all the chapters except Chapter 9 of Quran and it is recited before start reciting any Quran verse by Muslims. In a sense, it is like a key. This makes Quran as a book consisting of numbered and unnumbered verses. Therefore, in the text analysis of Quran, I will consider two categories that represent the two fundamental structure of it. First one represents the whole Quran, including numbered and unnumbered verses. Second, represents only the numbered verses, which means without those repeated unnumbered Basmalas in front of chapters. Since, this numbered and unnumbered verses are part of Quran, they might have a role in the 19 based coding system of Quran. I will discuss about this and show some evidences on it in Chapter 5. Now, I will generate a second object that keeps only the numbered verses of Quran for further analysis on it. Let’s see first verses of some of the chapters in the main object that keeps the whole Quran.

``print(quran\$text[quran\$verse<=2 & quran\$chapter==1])``
``##  "بسم الله الرحمن الرحيم" "الحمد لله رب العالمين"``
``quran\$text[quran\$verse<=2 & quran\$chapter==2]``
``##  "بسم الله الرحمن الرحيم الم"        "ذلك الكتاب لا ريب فيه هدى للمتقين"``
``quran\$text[quran\$verse<=2 & quran\$chapter==3]``
``##  "بسم الله الرحمن الرحيم الم"     "الله لا إله إلا هو الحي القيوم"``

As we see, tanzil.net has themselves included all the Basmalas inside the first verse of each chapter contrary to the written hard copy text of Quran that Muslims have since the beginning till now. tanzil.net probably might have done so for computational reasons to simplify the organization of the text. I analysed both of those two structure types of the text and computationally showed (in Chapter 5) that the numerical codings also supports the structure of the authentic printed copy as is. So, in the following, I will generate the second main table by separating those unnumbered Basmalas and get the table that keeps only the numbered verses of Quran.

``````require(data.table, quietly = T)
quran <- data.table(quran)
nQuran <- quran
nQuran\$text <- gsub("^بسم الله الرحمن الرحيم ","",quran\$text)
nQuran\$text[nQuran\$verse<=2 & nQuran\$chapter==1]``````
``##  "بسم الله الرحمن الرحيم" "الحمد لله رب العالمين"``
``nQuran\$text[nQuran\$verse<=2 & nQuran\$chapter==2]``
``##  "الم"                               "ذلك الكتاب لا ريب فيه هدى للمتقين"``
``nQuran\$text[nQuran\$verse<=2 & nQuran\$chapter==3]``
``##  "الم"                            "الله لا إله إلا هو الحي القيوم"``
``Basmala <-quran\$text #keep for future reference``

As we see, in this second object, only first chapter has first verse as Basmala and numbered but other chapters do not have it as this table contains only the numbered verses. Also, I assigned the special verse Basmala into the R object “Basmala” for future reference in this book. So this R object, denoted with nQuran, represents only the numbered verses of Quran. Let’s add one more column that keeps chapter and verse numbers together for easy referencing from the data table that I present below.

``````require(DT, quietly = T)
caption = 'Table head of the numbered verses of Quran.')``````
Table 3.1: Table head of the numbered verses of Quran.
VerseI chapter verse text
1 1 1 بسم الله الرحمن الرحيم
2 1 2 الحمد لله رب العالمين
3 1 3 الرحمن الرحيم
4 1 4 مالك يوم الدين
5 1 5 إياك نعبد وإياك نستعين
6 1 6 اهدنا الصراط المستقيم

For future reference and analysis, I will also generate another table from this table, which holds the chapter indices and the number of verses per chapter as follows. I will assign this table into the R object dfVC for future reference in this book.

``````require(data.table, quietly = T)
versecomb <- c()
for(j in 1:114){
i <- which(nQuran\$chapter==j)
versecomb <- c(versecomb, nQuran\$verse[i[length(i)] ])
}

dfVC <- data.table(cbind(c(1:114), versecomb))
colnames(dfVC) <- c("Chapter_index","Verse_sum")
datatable(dfVC,
caption = 'Table: The chapter indices and corresponding sum of verses of numbered verses.',
options = list(pageLength = 5,
autoWidth = TRUE),
rownames= FALSE)``````

## 3.5 Some text mining

Let’s prepare a more comprehensive data table that keeps some further information about the numbers of text of Quran using text mining tools of R. I utilized the tutorial in (Sharaf 2019a) and the R package (Mullen et al. 2018) to get each word and its frequencies, even the frequencies of the letters of it. I used the R programming language but I also used the ‘tokenizers’ text mining R package (Mullen et al. 2018) to get each word from the text of Quran.

I prefer to keep this book for all readers and thus will not go into details of explaining each lines of the code chunk below. In short, it computes the numbers of words and letters in both types of the text of Quran. I will keep using these R objects in the rest of the book as needed in the coming chapters.

``````require(tokenizers, quietly = T)
#All words in numbered verses
words <- unlist(tokenize_words(nQuran\$text))
w <-  length(words) # should be 77797
cat("Number of words in numbered verses is ", w)``````
``## Number of words in numbered verses is  77797``
``````#number of letters in numbered verses
letters <-  sapply(words, nchar)
l <- sum(letters) #should be 330709
cat("Number of letters in numbered verses is ", l)``````
``## Number of letters in numbered verses is  330709``
``````#All words in numbered and unnumbered verses
Words <- unlist(tokenize_words(quran\$text))
W <- length(Words) #should be 78245
cat("Number of words in numbered and unnumbered verses is ", W)``````
``## Number of words in numbered and unnumbered verses is  78245``
``````Letters <-  sapply(Words, nchar)
L <- sum(Letters) #should be 332837
cat("Number of letters in numbered and unnumbered verses is ", L)``````
``## Number of letters in numbered and unnumbered verses is  332837``

Now, we obtained the number of words and letters per verse in both types of the text of Quran. Let’s add this information into the table of the numbered verses of Quran as follows.

``````vwords<- c()
vletters <- c()
for(i in 1:nrow(nQuran)){
tmpw <- unlist(tokenize_words(nQuran\$text[i]))
vwords <- c(vwords,length(tmpw))
vletters <- c(vletters,sum(nchar(tmpw)))
}
nQuran<- cbind(nQuran[,1:3],vwords, vletters,nQuran[,4])
colnames(nQuran) <- "text"

require(data.table)
require(DT)

tmpN <- nQuran
tmpN\$CV <- paste(nQuran\$chapter,nQuran\$verse, sep = ":")

datatable(tmpN,
caption = 'Table of numbered verses of Quran',
options = list(pageLength = 5,
autoWidth = TRUE),
rownames= FALSE)``````
``````## Warning in instance\$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html``````

From this data we can also get the number of unique words and its number of letters of Quran a presented in the next chapter. But, before that lets also generate the similar table for the whole Quran text, including unnumbered Basmala verses as follows. As see, we represent the table with unQuran R object to be able to use in the rest of the book.

``````unQuran <- c()
for(i in 1:114){
if(!(i %in% c(1,9))) {
tmp <- data.frame(1,i,0,4,19,as.character(nQuran\$text))
colnames(tmp) <- colnames(nQuran)
unQuran <- rbind(unQuran,tmp,
nQuran[nQuran\$chapter==i,])}else{
unQuran <- rbind(unQuran,
nQuran[nQuran\$chapter==i,])
}
}
unQuran\$VerseI <- c(1:nrow(unQuran))

tmpUN <- unQuran
tmpUN\$CV <- paste(unQuran\$chapter,unQuran\$verse, sep = ":")

datatable(tmpUN,
caption = 'Table of all verses of Quran',
options = list(pageLength = 5,
autoWidth = TRUE),
rownames= FALSE)``````
``````## Warning in instance\$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html``````