Chapter 3 About the Text of Quran

In this chapter, I will mainly show, first, how to fetch the text of Quran into R programming environment, then compute some of the important descriptive numbers of the text of Quran but I will not analyse those numbers in this chapter and leave it to the chapters about evidences such as Chapter 5.

I downloaded the text of Quran from tanzil.net/download. Arabic speakers do not need punctuation and thus there are no punctuation in the original text. I am not able to speak in more technical details as I do not know Arabic. After searching the internet, I found this tutorial (textminingthequran.com/tutorial/quran.html) and I followed it. As instructed in it, we go to the tanzil.net/download and select "Simple Clean" option without any pause marks or other options. For this, deselect "Include pause marks" and "Include sajdah signs" options. These are all not part of the simple and pure text but added later to ease the reading. Then select "Text(with aya numbers)" option and click download button. This way you will get a pure text of Quran in your computer with verse ("aya" in Arabic) numbers. In this file, there are "sura" numbers (sura is in Arabic for chapter) followed by verse numbers followed by the actual text of the verses. You can open it in a text editor and better see its structure. Each fields is separated by a bar "|" symbol and there is also some copyright notes at the end of the text, which are all added by tanzil.net as part of their. Therefore, as you will set below, while reading the downloaded text file into the R programming environment, we remove them first.

Everything is made up from smaller components. When it comes to the text of a book, the main components are letters, words, verses and chapters. Therefore, I first searched over these components if there is any 19 based coding design pattern over them. As you will see in the evidences chapters, there are indeed 19 based coding system over the text of Quran. I think I could not solve all the system yet but the evidences I provide are sufficient to witness a beautiful and strong 19 based coding system of the text of Quran, which will be contributed by this book as further proofs to support the belief that Quran is intact and unchanged from the beginning.

In the following of this chapter, you will see how to fetch the text of Quran into R programming environment and also performs some basic text processing and get the important descriptive numbers of Quran.

An important point to remind is that the structure of the text of Quran is also unique and not the same as any other we might come across. All the verses are numbered but there are 112 unnumbered and repeated Basmala verse in front of all the 112 chapter out of the 114 chapters. Also the first verse of the first chapter is the numbered Basmala verse. Moreover, Chapter 9 does not have any Basmala in contrary to the rest of 113 chapters but interesting enough Chapter 27 has a Basmala verse within its chapter within the context. So, it looks there is this non-standard structure and deliberate organization. It leave question whether to refer the total descriptive numbers (e.g. verse, words and letters) with respect to only the numbered verses or together (numbered and unnumbered Basmala verses). In my analysis, I analysed both of them and witnessed that they are designed together and both types of descriptive numbers are valid because they designed together as you will see in the evidences. So, I will always refer both of the type of the text of Quran with respect to, first, the numbered verses only and, second, the numbered and unnumbered verses together.

Before going any further, one tiny point I want to mention about the text of Quran is the punctuation. Arabic speakers do not need them and thus they are not part of the original text but they are added as helpers later on for non-Arabic speaks such as Turkish speaking people like me. I would not normally even talk about this point but when I had a discussion with one of my non-Muslim friends, he made an argument based on those punctuation and claimed that the text of Quran has been change as other wholly books. I was quite surprised that even such a simple fact was being used as an argument and he got this mis-information from the internet as later searched about this. Anyway, in short, punctuation are not part of the text of the text of Quran and does not included in text processing analysis. If one have doubts about it, she should feel free to speak to an Arabic expert about it.

3.1 About the manuscript I obtaine from Tanzil server

I added this Chapter 3.1 when I upgraded this book around a year later of the first publication date of this book that was 2019. I deemed this needed following the feedback I get as people mostly focus and ask about the text I use and the reasons why did I not use a classical paper printed ones.

First of all, I did not cherry pick this manuscript. As I mentioned before, I found and followed this tutorial (textminingthequran.com/tutorial/quran.html), where the author used this text of Quran for his own text analysis. However, it turned out to be that this manuscript is the most accurate according to my investigations for two reasons. First, I have many evidences that suggest that there is clear 19 based design that incorporates all the letters of this manuscript in their current verse locations. Second, it is also a know fact that even the most respected authentic Uthmanic mushaf contains some writing errors, which has been conveyed to today's modern texts as is without deliberately correcting them. The reason for that is because, firstly, scholars considered that they already know exactly how each word is pronounced even though there are some slight scribal-type errors in writing of some words in the early manuscripts. Secondly, they preferred not to touch and change any historically conveyed material as they have deep respect to the early Muslims who first conveyed the message. Since the primary medium of transmission of Quran has always been vocal, they did not consider the slight errors in writings of some letters as big of a deal since the transmission via writing has always been a secondary medium. I direct the interested readers on this topic to my interview with Dr. Shehzad Saleem for further details (youtube.com/watch?v=1Y64epJ9_h0&t=2804s). Considering those scribal known but ignored minor errors, it turned out that the only, at least acclaimed, error free version of Quran text is from the Tanzil Project, with which a collective effort resulted an error free online text version of Quran. It can be easily downloaded from tanzil.net/download by all and being use by the most common Quran server, quran.com.

In their web site (http://tanzil.net/docs/tanzil_project) and their FAQ page, they answer some very relevant questions and their answers I quote below as is:

"Tanzil is a Quranic project launched in early 2007 to produce a highly verified Unicode Quran text to be used in Quranic websites and applications. Our mission in the Tanzil project is to produce a standard Unicode Quran text and serve as a reliable source for this standard text on the web."

"Our mission in the Tanzil project is to produce a standard Unicode Quran text and serve as a reliable source for this standard text on the web."

They also state that they have checked the text against a set of grammatical and recitation rules, which, I think, might be the step that might have corrected all the known errors in the letters of some words.

"Rule-Based Verification: In this step, a program was developed to verify the core Quran text against a set of grammatical and recitation rules..."

"We extensively used the help of a group of experts including Quran specialists and Hafizes in order to assure the correctness and preciseness of the obtained Quran text. Our Quran text is now used in major Quranic websites and projects, with several millions active users per month. Despite this high volume of users, we have received no typo report since the release of the first version of the text in 2008. The accuracy of the text has been also reconfirmed by several projects that are actively using Tanzil Quran text."

" 'How accurate is Tanzil Quran text?' Tanzil Quran text is carefully derived from Medina Mushaf, which is currently the most authentic copy of the holy Quran (narration of Hafs)."

Without being aware of much of the above information in the beginning of my study for this book in the early 2019, I had hypothesized that, the text of Quran can well be designed by God to the last letter of it and I discovered the evidences that includes all the letters information as can be seen in Chapter 7. I presented them separately from the main evidences, as presented in Chapter 5 and also Chapter 6, which do not include letter information but just word information regarding the content and thus more general as it is applicable to all the common Hafs mushafs such as the one used in quran.com.

If I had used the classical Uthman manuscript while counting the letters, it would not be correct as there are already known errors in some letters in some of its words. Although people recite Quran correctly they did not change those letters and just ignored them. In case if there is any single letter is found the be erroneous in the text I used, then all most probably all the letter incorporating evidences that I present in Chapter 7 would collapse. So, feel free to reach out to me if you are 100% sure that there is an error in any of the 332837 total letters of the text of Quran manuscript I used. Nonetheless, make sure to consult with an expert on Arabic grammar before coming to conclusion on this.

Since the the letter incorporating evidences that I present in Chapter 7 are not the main evidences of this book, anyone who have doubts about the accuracy of the letters might just opt to ignore this chapter. It will not change the fact that the text of Quran has been designed up to its words along with chapters and verses as is when considering the main miraculous 19 based system I presented in Chapter 5. However, currently, there seems to be no other choice than the text I used. Because, all the classical texts based on Uthman manuscript are known to have some errors in some letters in some of the words of them. There is no point to count the letters of any manuscript if there are some known letter errors in it at the first place. As I quoted above from http://tanzil.net/docs/tanzil_project, they did not receive any typo report since the release of the first version of the text in 2008. So, it looks to be the only error free version of the text of Quran when considering the letter counts.

However, the ones who would like to genuinely consider the letters based evidences of Chapter 7, might well have some questions considering the feedback I received after the first publication of this book.

They might think for instance, "this text is written latest by 2008 and how can God design it"? My answer to this question is with a counter question. "Can God not know how people were going to write down the online version of the text of Quran that is accessed by all humankind by just one click? Or, can God not manipulate the mind of people who will write down the text of Quran for the online version as is?". If the answer to any of these questions is "yes", then there is no point to argue further on the text I used.

Another question might pop up to some if the volunteer people of tanzil.net has deliberately made such a design considering the letters based evidences of Chapter 7? Well, I know the answer on my side for sure. I discovered all the evidences in this book myself. I cannot know for sure the other side, the volunteer people of tanzil.net, as I do not know them at all. However, in my opinion, I am personally 100% sure that they did not do such think and even they cannot do such think. Because, as I quoted above, they have written the text with respect to Arabic grammar rules and the early Hafs manuscripts. The only way, one can claim such an artificial design is if one can find an artificial rule that is not meaningful and artificial or redundant for the text. In such a case one might claim that this artificial or redundant letters might be written to make such a design. I have discovered the evidences in 2019 and as they mention in their quotes above, they did not receive any typo report since the release of the first version of the text in 2008. So, I see no reason to doubt for that extreme paranoiac potential doubt. I addressed those points anyway in case some very skeptical people wonders about them.

Let's now see how the text is fetched and analyzed in the following chapters.

3.2 Fetching Quran's Text into R

In order to be able to process and analyze the text of Quran, we need to fetch it into an R console. For that, I utilized this very useful tutorial at (Sharaf 2019b). As mentioned in that tutorial, I downloaded the text file of Quran from http://tanzil.net/download/ with "Simple Clean" option without any pause marks or other options, and also with the "Text (with aya numbers)" option selected. I then saved downloaded text file in a folder with the name "data" under the current working directory.

We can run this to read Quran text file into R programming environment:

options(tinytex.verbose = TRUE)
tenzil = read.csv("data/quran-simple-clean.txt", header=F, stringsAsFactor=F, encoding="UTF-8", sep="|")

Let's see the head and tail of this text file to see if there is anything added to describe the text.

head(tenzil)
##   V1 V2                     V3
## 1  1  1 بسم الله الرحمن الرحيم
## 2  1  2  الحمد لله رب العالمين
## 3  1  3          الرحمن الرحيم
## 4  1  4         مالك يوم الدين
## 5  1  5 إياك نعبد وإياك نستعين
## 6  1  6  اهدنا الصراط المستقيم
tail(tenzil)
##                                                                         V1 V2
## 6259 #    of the text, and shall be reproduced appropriately in all files  NA
## 6260     #    derived from or containing substantial portion of this text. NA
## 6261                                                                     # NA
## 6262                #  Please check updates at: http://tanzil.net/updates/ NA
## 6263                                                                    #  NA
## 6264 #==================================================================== NA
##      V3
## 6259   
## 6260   
## 6261   
## 6262   
## 6263   
## 6264

It looks, the head of the data is clean and it is a table with three columns (a data frame in R). Each row contains one verse in order. First column contains the chapter numbers (or, in Arabic term, 'sura' numbers). Second column has verse numbers (or, in Arabic term, 'aya' numbers). Third column has the verses in Arabic.

However, there is some license related information appended to the tail of the text by tanzil.net. Let's find where to clean it at the tail of the text file.

tenzil[6234:6238,]
##                                                                         V1 V2
## 6234                                                                   114  4
## 6235                                                                   114  5
## 6236                                                                   114  6
## 6237                 # PLEASE DO NOT REMOVE OR CHANGE THIS COPYRIGHT BLOCK NA
## 6238 #==================================================================== NA
##                            V3
## 6234     من شر الوسواس الخناس
## 6235 الذي يوسوس في صدور الناس
## 6236          من الجنة والناس
## 6237                         
## 6238

As we see the last verse appears to be in the row index of 6236. The last chapter (sura) in Quran is the Chapter Nas (sura-al-nas). Since it is very short, almost all Muslims would have memorized it and sometimes recite it in their regular daily prayers. Although, I am no expert in Arabic, I can also recognize it easily from the Arabic writing even without the helper punctuations that we non-Arabic speakers need to be able to read the Arabic text of Quran with the correct pronunciation. Without Arabic knowledge, even I can confidently recognize the last verse of Quran from the verse in the index 6236. So, as instructed in the tutorial, I remove the rest of the last verse to clean the additional general text related information added by tanzil.net. I keep this table as an the R object and named as quran to remember that this object keeps all the words and letters of Quran that we hold in our hand.

quran <- tenzil[1:6236,]
tail(quran)
##       V1 V2                                       V3
## 6231 114  1 بسم الله الرحمن الرحيم قل أعوذ برب الناس
## 6232 114  2                                ملك الناس
## 6233 114  3                                إله الناس
## 6234 114  4                     من شر الوسواس الخناس
## 6235 114  5                 الذي يوسوس في صدور الناس
## 6236 114  6                          من الجنة والناس

As we see, the first column of the table keeps the chapter number, the second column keeps verse number and the third column keeps the text of each verse. Basically, each row of this table keeps the information about one verse in order. Let's first give their correct names in English to the columns and see the table again.

colnames(quran) = c("chapter", "verse", "text")
head(quran)
##   chapter verse                   text
## 1       1     1 بسم الله الرحمن الرحيم
## 2       1     2  الحمد لله رب العالمين
## 3       1     3          الرحمن الرحيم
## 4       1     4         مالك يوم الدين
## 5       1     5 إياك نعبد وإياك نستعين
## 6       1     6  اهدنا الصراط المستقيم

Let's also add the row names of the table into the table. Row names should keep the verse order from beginning to end, which we can also test it later on. Since Quran has an order and the order is important, we start given independent verse index numbers from first verse to the last as row names to be able to correctly access them later. I name the column of this independent verse indices as "VerseI" in the table. It is important to remember that this column is given by us independently and the verse indices per chapter that we refer when we quote any specific verse in Quran (such as this formal notation: 74:30).

quran <- cbind(as.numeric(rownames(quran)), quran)
colnames(quran)[1] = "VerseI"
quran$VerseI <- as.numeric(quran$VerseI)
quran$verse <- as.numeric(quran$verse)
quran$chapter <- as.numeric(quran$chapter)
head(quran)
##   VerseI chapter verse                   text
## 1      1       1     1 بسم الله الرحمن الرحيم
## 2      2       1     2  الحمد لله رب العالمين
## 3      3       1     3          الرحمن الرحيم
## 4      4       1     4         مالك يوم الدين
## 5      5       1     5 إياك نعبد وإياك نستعين
## 6      6       1     6  اهدنا الصراط المستقيم

3.3 The categories of the main descriptive numbers

It is helpful to define the categories of the main descriptive numbers of the text of Quran to clarify this point. Because I will keep mentioning the categories of the numbers while defining the rules of the coding system in Chapter 4 and later present evidences with them.

There are four main descriptive numbers of Quran: the number of chapters, verses, words and letters. All the categories of the descriptive numbers of the text of Quran, except chapters, has numbered type and also the numbered and unnumbered type together versions because of the unique structure and organization of Quran that we observe in the book of Quran in our hands today. Basically, each of the three categories has two types Since the number of chapters has only numbered version, this category has a single number. Therefore, there are 7 main descriptive numbers of the text of Quran. In this chapter, I will compute them blindly via only the computer programming and provide its codes so that you can also reproduce and test those numbers from the text of Quran.

3.4 Number of chapters and verses, and sanity checks on the text

Let's now first check if the indices of verses are in correct order. I will perform two tests here. First by the sum of all the index numbers, second by a simple plot and see if we observe what we expect for. We know that sum of the unique integer numbers from 1 to n is nx(n+1)/2 (by Gauss formula). Therefore, sum of the indices of the verses must be 6236x6237/2 = 19446966. In your computer, make sure its precision is capable of dealing with large numbers. Alternatively, you can also use a big number calculator such as this one (“Good Calculators” 2019).

Now let's write a code to sum the index values in the table and see if it matches to 19446966 as it must be.

cat("the sum of verse index column VerseI is ",sum(quran$VerseI))
## the sum of verse index column VerseI is  19446966
if(sum(quran$VerseI) == 19446966) 
  print("The sum of the indices of verses are correct and passed this test.")
## [1] "The sum of the indices of verses are correct and passed this test."

The first sum test passed for the indices of the verses. But, still in the middle or any other part of it the indices might be multiple of two numbers or more than the maximum 6236. Let's see the minimum, which must be 1, maximum, which must be 6236, median, which must be 3118.5 (median of even number) from the data and also most importantly the number of unique indices that must be 6236 too.

print(paste("Minimum, Maximum, Median of VerseI is ",
    min(quran$VerseI),",", max(quran$VerseI),",",median(quran$VerseI)))
## [1] "Minimum, Maximum, Median of VerseI is  1 , 6236 , 3118.5"
print(paste("Number of unique values of VerseI is ",length(unique(quran$VerseI))))
## [1] "Number of unique values of VerseI is  6236"

As we see all are correct as expected. Let's plot it and see if it is monotonically increasing from 1 to 6236.

plot(quran$VerseI)

The plot is exactly as we expected that is increasing monotonically from chapter 1 to 114. Let's now check in a similar way, if the chapter numbers are in order and then observe the maximum number.

plot(quran$chapter)

print("Unique chapter numbers regarding the order of text: ")
## [1] "Unique chapter numbers regarding the order of text: "
print(unique(quran$chapter))
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114

The plot is exactly is as we expected as increasing in general but have horizontal lines at some points because of the long chapters.

So, based on the mathematical and visual tests, we conclude that the text of Quran analysed in this book, which was downloaded from tanzil.net, has no apparent error in its text regarding verse and chapter indices and ready for further analysis.

These sanity checks are purely based on mathematical blind computations on the text of Quran. Therefore, they also blindly provide the number of chapters as 114 and the number of verses as 6236. Therefore, these sanity checks provides reproducible tests to count the number of chapters and verses of Quran and they confirm that its number of chapters is 114, which will be denoted by 'c' for the rest of the book, and number of verses is 6236, which will be denoted by 'v' for the rest of the book in the R programming environment. If we include the unnumbered Basmala verses as well, then we get, an optional, total number of verses that includes all the numbered verses and all the unnumbered verses (112 Basmalas) together, which is equal to 6236+112=6348 and we denote this number by 'V' for future reference. We will demonstrate how those numbers are part of 19 based coding system in the Chapter 5.

c <-  114
v <- nrow(quran) #6236
V <- v+112       #6348

3.5 Numbered and Unnumbered Verses of Quran

Regarding its importance for the analysis, I will separately address the numbered verses and the numbered and unnumbered verse together as the types of the text of Quran. Quran has an out of ordinary structure than we used to see in other usual books and in that sense it also stands as unique. There are 114 chapters and they are all numbered from 1 to 114 and ordered deliberately as is in Quran. There are also verses and they are also numbered from 1 to the end of the verses of each chapter. For example, the first chapter, also the most famous one, al-Fatiha has chapter number 1 and it has 7 verses and each verses are numbered from 1 to 7. There is no such concept of paragraphs that we used read in our books but in a sense each sentence or a group of sentences together corresponds to each verse and numbered. This is very useful when we refer to a specific verse in Quran as we can easily quote two numbers to refer to it precisely. As an example, 19:38 refers to the chapter 19 and verse 38. Some verses are long and some verses are very short.

However, there is another interesting structural situation in the text of Quran. There is a special verse, Basmala (بسم الله الرحمن الرحيم), which is the first verse of the first chapter, namely 1:1 regarding its formal chapter and verse numbers. The translation of Basmala is "In the name of God, the merciful, the compassionate". In Quran, this special verse is written before all the chapters except Chapter 9 of Quran and it is recited before start reciting any Quran verse by Muslims. In a sense, it is like a key. This makes Quran as a book consisting of numbered and unnumbered verses. Therefore, in the text analysis of Quran, I will consider two categories that represent the two fundamental structure of it. First one represents the whole Quran, including numbered and unnumbered verses. Second, represents only the numbered verses, which means without those repeated unnumbered Basmalas in front of chapters. Since, this numbered and unnumbered verses are part of Quran, they might have a role in the 19 based coding system of Quran. I will discuss about this and show some evidences on it in Chapter 5. Now, I will generate a second object that keeps only the numbered verses of Quran for further analysis on it. Let's see first verses of some of the chapters in the main object that keeps the whole Quran.

print(quran$text[quran$verse<=2 & quran$chapter==1])
## [1] "بسم الله الرحمن الرحيم" "الحمد لله رب العالمين"
quran$text[quran$verse<=2 & quran$chapter==2]
## [1] "بسم الله الرحمن الرحيم الم"        "ذلك الكتاب لا ريب فيه هدى للمتقين"
quran$text[quran$verse<=2 & quran$chapter==3]
## [1] "بسم الله الرحمن الرحيم الم"     "الله لا إله إلا هو الحي القيوم"

As we see, tanzil.net has themselves included all the Basmalas inside the first verse of each chapter contrary to the written hard copy text of Quran that Muslims have since the beginning till now. tanzil.net probably might have done so for computational reasons to simplify the organization of the text. I analysed both of those two structure types of the text and computationally showed (in Chapter 5) that the numerical codings also supports the structure of the authentic printed copy as is. So, in the following, I will generate the second main table by separating those unnumbered Basmalas and get the table that keeps only the numbered verses of Quran.

require(data.table, quietly = T)
## 
## Attaching package: 'data.table'
## The following object is masked _by_ '.GlobalEnv':
## 
##     .N
quran <- data.table(quran)
nQuran <- quran
nQuran$text <- gsub("^بسم الله الرحمن الرحيم ","",quran$text)
nQuran$text[nQuran$verse<=2 & nQuran$chapter==1]
## [1] "بسم الله الرحمن الرحيم" "الحمد لله رب العالمين"
nQuran$text[nQuran$verse<=2 & nQuran$chapter==2]
## [1] "الم"                               "ذلك الكتاب لا ريب فيه هدى للمتقين"
nQuran$text[nQuran$verse<=2 & nQuran$chapter==3]
## [1] "الم"                            "الله لا إله إلا هو الحي القيوم"
Basmala <-quran$text[1] #keep for future reference

As we see, in this second object, only first chapter has first verse as Basmala and numbered but other chapters do not have it as this table contains only the numbered verses. Also, I assigned the special verse Basmala into the R object "Basmala" for future reference in this book. So this R object, denoted with nQuran, represents only the numbered verses of Quran. Let's add one more column that keeps chapter and verse numbers together for easy referencing from the data table that I present below.

require(DT, quietly = T)
knitr::kable(head(nQuran), booktabs = TRUE,
  caption = 'Table head of the numbered verses of Quran.')
Table 3.1: Table head of the numbered verses of Quran.
VerseI chapter verse text
1 1 1 بسم الله الرحمن الرحيم
2 1 2 الحمد لله رب العالمين
3 1 3 الرحمن الرحيم
4 1 4 مالك يوم الدين
5 1 5 إياك نعبد وإياك نستعين
6 1 6 اهدنا الصراط المستقيم

For future reference and analysis, I will also generate another table from this table, which holds the chapter indices and the number of verses per chapter as follows. I will assign this table into the R object dfVC for future reference in this book.

require(data.table, quietly = T)
require(DT)
versecomb <- c()
for(j in 1:114){
  i <- which(nQuran$chapter==j)
  versecomb <- c(versecomb, nQuran$verse[i[length(i)] ])
}

dfVC <- data.table(cbind(c(1:114), versecomb))
colnames(dfVC) <- c("Chapter_index","Verse_sum")
datatable(dfVC,
          caption = 'Table: The chapter indices and corresponding sum of verses of numbered verses.',
          options = list(pageLength = 5, 
                              autoWidth = TRUE),
          rownames= FALSE)

3.6 Some text mining

Let's prepare a more comprehensive data table that keeps some further information about the numbers of text of Quran using text mining tools of R. I utilized the tutorial in (Sharaf 2019a) and the R package (Mullen et al. 2018) to get each word and its frequencies, even the frequencies of the letters of it. I used the R programming language but I also used the 'tokenizers' text mining R package (Mullen et al. 2018) to get each word from the text of Quran.

I prefer to keep this book for all readers and thus will not go into details of explaining each lines of the code chunk below. In short, it computes the numbers of words and letters in both types of the text of Quran. I will keep using these R objects in the rest of the book as needed in the coming chapters.

require(tokenizers, quietly = T)
#All words in numbered verses
words <- unlist(tokenize_words(nQuran$text))
w <-  length(words) # should be 77797
cat("Number of words in numbered verses is ", w)
## Number of words in numbered verses is  77797
#number of letters in numbered verses
letters <-  sapply(words, nchar)
l <- sum(letters) #should be 330709
cat("Number of letters in numbered verses is ", l)
## Number of letters in numbered verses is  330709
#All words in numbered and unnumbered verses
Words <- unlist(tokenize_words(quran$text))
W <- length(Words) #should be 78245
cat("Number of words in numbered and unnumbered verses is ", W)
## Number of words in numbered and unnumbered verses is  78245
Letters <-  sapply(Words, nchar)
L <- sum(Letters) #should be 332837
cat("Number of letters in numbered and unnumbered verses is ", L)
## Number of letters in numbered and unnumbered verses is  332837

Now, we obtained the number of words and letters per verse in both types of the text of Quran. Let's add this information into the table of the numbered verses of Quran as follows.

vwords<- c()
vletters <- c()
for(i in 1:nrow(nQuran)){
  tmpw <- unlist(tokenize_words(nQuran$text[i]))
  vwords <- c(vwords,length(tmpw))
  vletters <- c(vletters,sum(nchar(tmpw)))
}
nQuran<- cbind(nQuran[,1:3],vwords, vletters,nQuran[,4])
colnames(nQuran)[6] <- "text"

require(data.table)
require(DT)

tmpN <- nQuran
tmpN$CV <- paste(nQuran$chapter,nQuran$verse, sep = ":")


datatable(tmpN,
          caption = 'Table of numbered verses of Quran',
          options = list(pageLength = 5, 
                              autoWidth = TRUE),
          rownames= FALSE)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

From this data we can also get the number of unique words and its number of letters of Quran a presented in the next chapter. But, before that lets also generate the similar table for the whole Quran text, including unnumbered Basmala verses as follows. As see, we represent the table with unQuran R object to be able to use in the rest of the book.

unQuran <- c()
for(i in 1:114){
  if(!(i %in% c(1,9))) {
    tmp <- data.frame(1,i,0,4,19,as.character(nQuran$text[1]))
    colnames(tmp) <- colnames(nQuran)
    unQuran <- rbind(unQuran,tmp,
                     nQuran[nQuran$chapter==i,])}else{
                                        unQuran <- rbind(unQuran,
                                        nQuran[nQuran$chapter==i,])
                                        }
}
unQuran$VerseI <- c(1:nrow(unQuran))

tmpUN <- unQuran
tmpUN$CV <- paste(unQuran$chapter,unQuran$verse, sep = ":")


datatable(tmpUN,
          caption = 'Table of all verses of Quran',
          options = list(pageLength = 5, 
                              autoWidth = TRUE),
          rownames= FALSE)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html