第 17 章利用文本挖掘技术分析文献摘要

library(tidyverse)
library(here)
library(fs)
library(purrr)
library(tidytext)
library(widyr)
library(tidygraph)
library(ggraph)

本章基于R语言文本挖掘技术，分析文献摘要。具体做法是，利用R语言tidyverse、tidytext、widyr、tidygraph、ggraph等宏包分析我校文献(Web of Science)摘要的文本信息。⁶

17.1 数据导入

为了研究的可重复性，我列出了数据获取步骤： - 打开https://www.webofknowledge.com/，进入核心合集 - 输入学校全名：比如 Sichuan Normal University - 选择“机构扩展”检索 - 选择时间范围：“2009-2018年” - 选择“SCI/SSCI/A&HCI” - 点击检索 - 文档类型精炼：”Article + Review “ - 一次显示最多 50 条，一次下载最多 500 条 - 选择“其他类型下载” + “全记录与引用的参考文献” + “win UTF” - 依此下载保存

我们共获取了 1988 条文献题录数据。

read_plus <- function(flnm) {
        read_tsv(flnm, quote = "", col_names = TRUE) %>% 
        #or
        #read_delim(flnm,  delim="\t" , quote = "", col_names = TRUE) %>% 
        #select(AU, AF, SO, DE, C1, RP, FU, CR, TC, SN, PY, UT) %>% 
        select(AB, SN, UT) #%>% 
        # mutate(University = flnm %>%                # 加入了学校名
        #                     str_split("/", 7, simplify = TRUE) %>%
        #                     .[, 6] %>% 
        #                     str_sub(start = 4)
        #        ) 
}

tbl <- here("data", "newdata") %>% 
  dir_ls(regexp = "*.txt", recursive = TRUE) %>%  
  map_dfr(~read_plus(.))
tbl

ABCDEFGHIJ0123456789

AB <chr>
Four antimony fluoride sulfates named A(2)SO(4)SbF(3) (A = Na+, NH4+, K+, Rb+) have been successfully synthesized using a hydrothermal method by introducing Sb3+ cations with a stereochemically active lone pair in sulfates and subsequently tuning the structure through the second monovalent cations. All of the title compounds are stoichiometrically equivalent materials which share a common structural motif composed of a distorted SO4 tetrahedron and an asymmetric SbF3 polyhedron. However, the macroscopic centricities of these four compounds are significantly influenced by the size and coordination environment of cations; Na2SO4SbF3 crystallizes in centrosymmetric space groups Cmca and (NH4)(2)SO4SbF3 in Pbca, while K2SO4SbF3 and Rb2SO4SbF3 crystallizes in noncentrosymmetric space group P2(1)2(1)2(1). Complete characterization including thermal analyses, infrared and UV-vis spectroscopy, and theoretical calculations is also reported. Powder second harmonic generation measurement for noncentrosymmetric K2SO4SbF3 and Rb2SO4SbF3 indicated that both of them are type I phase-matchable.
A new species of Yunnanilus is described from Tuojiang River, Sichuan, China. The new species, Yunnanilus jiuchiensis, can be distinguished from other species of Yunnanilus by the following combination of characters: processus dentiformis absent; body covered with scarce scales; lateral-line incomplete, as long as half the length of the pectoral-fin length, with 6-11 pores; eye diameter larger than interorbital width; and caudal-peduncle length less than its depth.
Outside soil spray seeding (OSSS) is used widely for road cut revegetation, and the artificial soil used in OSSS can improve slope soil conditions and nutrients, and help promote plant growth and succession. Three different slopes was investigated to evaluate the effectiveness of OSSS for restoration, including a natural slope (NS), a cut slope without any artificial recovery treatment (CSW) and a cut slope treated with OSSS (CSO). The recovery of cut slopes was determined by evaluating a number of factors, including indices associated with plants on the slopes, soil enzyme activities (urease and sucrase), and soil nutrient content (soil organic matter (SOM), total phosphorous (TP), total potassium (TK), available nitrogen (AN), available phosphorous (AP), available potassium (AK), potassium (K+), calcium (Ca2+), magnesium (Mg2+), and sulphate (SO42-)). The results indicated that the vegetation and soil conditions differed between the three slopes. The Shannon-Wiener index (H), the Simpson index (D), and the Margalef index (R) values from the CSO and NS were lower than those of the CSW, whilst the Pielou index (E) value and vegetation canopy cover were higher for the CSO and NS than for the CSW. The content of SOM and AN in soil from the CSO was lower than in soil from the NS and CSW, and content of many nutrients were higher in soil from the CSO than in soil from the NS and CSW. This suggests that the restoration of vegetation and soil nutrients on the CSO was relatively successful. Our results indicated that the use of OSSS to restore cut slopes is effective in plateau areas. However, despite improvements in soil nutrient levels, there were still nutritional imbalances. Therefore, more attention should be paid to balancing nutrients in the later stage of OSSS implementation for the recovery of cut slopes at high altitudes.
To study the treatment effect and mechanism of a combined microwave (MW)-Fe-0/H2O2 Fenton-like process on concentrated landfill leachate, the effects of initial pH, Fe-0 dosage, H2O2 dosage, MW power and reaction time on the removal of organic substances were investigated. The phase change of Fe-0 before and after reaction and its catalytic mechanism were investigated using multiple analytical techniques. Results showed that the removal efficiencies of chemical oxygen demand, UV254 and color number were 58.70%, 85.69% and 88.30%, respectively, at initial pH of 2.0, Fe-0 dosage of 0.5 g/L, H2O2 dosage of 20 mL/L, MW power of 400 W and reaction time of 14 min. Comparison of different Fenton-like processes indicated that the MW-Fe-0/H2O2 Fenton-like process was the most efficient and significantly decreased the aromaticity degree, molecular weight and condensation degree of organic substances in the leachate. The fluorescence peak of concentrated leachate exhibited a blue-shift in the MW-Fe-0/H2O2 process, further indicating that the condensation degree of humic substances declined and molecular weight remarkably decreased. The mechanism exhibited an advanced oxidation effect of a heterogeneous Fenton reaction between iron oxide and H2O2, as well as of adsorption and precipitation effects of iron-based colloids, on organic substances. Moreover, thermal and non-thermal effects of MW accelerated these reactions, achieving fast removal of organic pollutants in the concentrated landfill leachate. Overall, the results of this study showe that the MW-Fe-0/H2O2 process is an effective and promising method to handle concentrated landfill leachate.
We performed the first-principles calculations on the elastic and thermal properties for chalcopyrite ZnSnX2 (X = P, As, Sb), employing the ultrasoft pseudo-potentials and generalized gradient approximation (GGA) under the frame of density functional theory. The equilibrium structural lattice constants are in good agreement with reported data. The elastic characteristics were evaluated under high-pressure condition (0-20 GPa), such as the elastic constants, bulk modulus, shear modulus, Poisson's ratio, Zener anisotropy and compressibility index. Combining with quasi-harmonic Debye model, the thermal properties were confirmed at different temperatures (0-1200 K) and pressures (0-20 GPa), including the heat capacity, thermal expansion, Debye temperature, entropy, and Gruneisen parameter. Based on the semi-empirical relation, the hardness of materials was determined at various temperatures and pressures. Finally, the phonon spectrum curves and vibration frequencies of phonon were evaluated to confirm the thermodynamic stability of ZnSnX2. The Raman scattering spectrum and infrared absorption spectrum were simulated for chalcopyrite ZnSnX2.
We report the preparation of an ammonia borane hydrolysis catalyst for use in hydrogen production by dispersing Rh nanoparticles on a nitrogen-doped carbon (NPC) support. The resulting Rh/NPC catalyst had a measured turnover frequency of 473.5 min(-1), higher than that of many previously reported Rh-based catalysts. This catalyst could also be reused eight times. The large surface area and abundant nitrogen-functional species of NPCs facilitate dispersion of Rh nanoparticles on their surface, providing numerous catalytically active sites for ammonia borane hydrolysis, thereby leading to high catalytic activity. This study demonstrates that NPC support can be used to prepare highly active catalysts. (C) 2018 Hydrogen Energy Publications LLC. Published by Elsevier Ltd. All rights reserved.
The Himalayan Monal Lophophorus impejanus is listed as National First Class Protected Animal in China, and also listed as Near Threatened species recently by the red list of China's vertebrates. In this study, the complete mitogenome sequence of the Himalayan Monal was determined for the first time. The mitogenome is a circular molecule of 16,709 bp in length, containing 13 protein-coding genes, two ribosome RNA genes, 22 transfer RNA genes and one non-coding regions. We also examine its phylogenetic position with respect to other eight Galliformes species. Tree constructed using Bayesian phylogenetic methods demonstrated L. impejanus as a sister to Lophophorus lhuysii. Our data would provide useful information for application in conservation genetics and evolution for the threatened species.
It is well known in [Absolutely pure modules, Proc. Amer. Math. Soc. 26 (1970) 561-566, Theorem 6] that a domain R is a Prufer domain if and only if every divisible R-module is absolutely pure, where an R-module A is called absolutely pure if Ext(R)(1) (N, A) = 0 for every finitely presented R-module N. In this paper, we extend this result to Prufer v-multiplication domains (PvMDs). To do this, comparing with ]An Introduction to Homological Algebra, 2nd edn. (Springer, Science+Business Media, LLC, New York, 2009), Theorem 3.69], we firstly give homological characterizations of w-purity, and we introduce the concept of absolutely w-pure modules over commutative rings with zero divisors. Finally, we prove that a domain R is a PvMD if and only if every divisible R-module is absolutely w-pure, and compare absolutely w-purity with absolutely purity by giving an example.
In this paper, we expand the Hamy mean (HM) operator and Dombi operations with interval-valued intuitionistic fuzzy numbers (IVIFNs) to propose the interval-valued intuitionistic fuzzy Dombi Hamy mean (IVIFDHM) operator, interval-valued intuitionistic fuzzy weighted Dombi Hamy mean (IVIFWDHM) operator, interval-valued intuitionistic fuzzy dual Dombi Hamy mean (IVIFDDHM) operator, and interval-valued intuitionistic fuzzy weighted dual Dombi Hamy mean (IVIFWDDHM) operator. Then the MADM models are designed with IVIFWDHM and IVIFWDDHM operators. Finally, we gave an example for evaluating the elderly tourism service quality in tourism destination to show the proposed models.
Carbon-based supercapacitor is one of the most promising energy conversion devices due to its ultrahigh power density and superior cycling durability, but most of carbon materials for high performance supercapacitor may involve high cost, sophisticated chemical procedures or tedious fabrication processes. Herein, a reproducible biomass-derived porous carbon with efficient ion-accessible surface and high content of heteroatoms has been successfully prepared by a simple high-temperature pyrolysis process. The facile chemical activation enables the as-synthesized materials own a hierarchical porous structure with an ideal pore size distribution and high contents of nitrogen (0.99 at%) and oxygen (8.99 at%), which is conducive to the high-efficiency transfer of electrolyte ions and enhancement in electrical conductivity of the materials. The as-fabricated hierarchical porous carbon materials deliver excellent specific capacitance of 287.1 F g(-1) at 1 A g(-1) and admirable cycling durability of 99.0% at current density of 1 A g(-1) after 10,000 cycles in 6.0 M KOH electrolyte. Remarkably, the assembled symmetric supercapacitor exhibits an excellent energy density of 43.0 Wh kg(-1) at power density of 875.0 W kg(-1) in ionic liquid electrolyte. This study shows that low cost porous carbon materials derived from biomass source by a facile pyrolysis might be a great option to fabricate high performance energy conversion device.

esi_plus_cas_IF_set <-  read_rds(here("data", "esiJournalsList",  "esi_plus_cas_IF_set.rds"))

complete_set <- tbl %>% 
  left_join(esi_plus_cas_IF_set, by = c("SN" = "ISSN") ) %>% 
              select(category = Category_ESI_cn, abstract = AB, pubs = UT) #%>%
              # rename(ISSN = SN) %>%

17.2 数据规整

complete_set

ABCDEFGHIJ0123456789

category <chr>
化学
植物学与动物学
环境科学与生态学
工程学
物理学
工程学
环境科学与生态学
数学
NA
化学

数据整理和文本分词

text_df <- complete_set %>% 
  filter(!is.na(abstract)) %>% 
  unnest_tokens(output = grams, input = abstract, token = "ngrams", n = 2)
text_df

ABCDEFGHIJ0123456789

category <chr>	pubs <chr>	grams <chr>
化学	WOS:000453550900030	four antimony
化学	WOS:000453550900030	antimony fluoride
化学	WOS:000453550900030	fluoride sulfates
化学	WOS:000453550900030	sulfates named
化学	WOS:000453550900030	named a
化学	WOS:000453550900030	a 2
化学	WOS:000453550900030	2 so
化学	WOS:000453550900030	so 4
化学	WOS:000453550900030	4 sbf
化学	WOS:000453550900030	sbf 3

过滤无用词汇

text_filted <- text_df %>% 
  separate(grams, into = c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word) 
text_filted

ABCDEFGHIJ0123456789

category <chr>	pubs <chr>	word1 <chr>	word2 <chr>
化学	WOS:000453550900030	antimony	fluoride
化学	WOS:000453550900030	fluoride	sulfates
化学	WOS:000453550900030	sulfates	named
化学	WOS:000453550900030	4	sbf
化学	WOS:000453550900030	sbf	3
化学	WOS:000453550900030	na	nh4
化学	WOS:000453550900030	successfully	synthesized
化学	WOS:000453550900030	hydrothermal	method
化学	WOS:000453550900030	introducing	sb3
化学	WOS:000453550900030	sb3	cations

text_unite <- text_filted %>% 
  unite(grams, word1, word2, sep = " ")
text_unite

ABCDEFGHIJ0123456789

category <chr>	pubs <chr>	grams <chr>
化学	WOS:000453550900030	antimony fluoride
化学	WOS:000453550900030	fluoride sulfates
化学	WOS:000453550900030	sulfates named
化学	WOS:000453550900030	4 sbf
化学	WOS:000453550900030	sbf 3
化学	WOS:000453550900030	na nh4
化学	WOS:000453550900030	successfully synthesized
化学	WOS:000453550900030	hydrothermal method
化学	WOS:000453550900030	introducing sb3
化学	WOS:000453550900030	sb3 cations

17.3 计算tf_idf

text_tf_idf <- text_unite %>% 
  count(pubs, grams) %>% 
  bind_tf_idf(pubs, grams, n) %>% 
  arrange(desc(tf_idf))
text_tf_idf

ABCDEFGHIJ0123456789

pubs <chr>	grams <chr>	n <int>	tf <dbl>	idf <dbl>	tf_idf <dbl>
WOS:000306005400005	matijevic's result	1	1	10.078	10.078
WOS:000335829800004	generalized krull	2	1	10.078	10.078
WOS:000303447200007	coherent ring	1	1	9.673	9.673
WOS:000303447200007	yoke module	2	1	9.673	9.673
WOS:000287359600007	semigroup ring	1	1	9.385	9.385
WOS:000332072200002	variational characteristic	1	1	9.385	9.385
WOS:000363286700011	coherent integrally	1	1	9.385	9.385
WOS:000363286700011	fp id	1	1	9.385	9.385
WOS:000397073100009	finite set	1	1	9.385	9.385
WOS:000397073100009	tan 2016	1	1	9.385	9.385

text_tf_idf %>% group_by(pubs) %>% 
  filter(max(tf_idf) > 9.1)  %>% 
  #dplyr::distinct(pubs)
  ggplot(aes(x = fct_reorder(grams, tf_idf), y = tf_idf, fill = pubs)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(vars(pubs), ncol = 2, scales = "free")

17.4 文本相似性

$\text{similarity} = \cos ( \theta ) = \frac { \mathbf { A } \cdot \mathbf { B } } { \| \mathbf { A } \| \| \mathbf { B } \| } = \frac { \sum _ { i = 1 } ^ { n } A _ { i } B _ { i } } { \sqrt { \sum _ { i = 1 } ^ { n } A _ { i } ^ { 2 } } \sqrt { \sum _ { i = 1 } ^ { n } B _ { i } ^ { 2 } } }$

text_tf_idf %>% 
  pairwise_similarity(pubs, grams, tf_idf, upper = FALSE, sort = TRUE)

ABCDEFGHIJ0123456789

item1 <chr>	item2 <chr>	similarity <dbl>
WOS:000305196500008	WOS:000304137900014	0.879913
WOS:000304578000008	WOS:000307572600001	0.603016
WOS:000406149900040	WOS:000423704100022	0.566003
WOS:000281250400006	WOS:000300573100007	0.561753
WOS:000411449700011	WOS:000428369700003	0.519942
WOS:000335200600020	WOS:000313207400037	0.386541
WOS:000367544600012	WOS:000341472000006	0.360928
WOS:000319082900005	WOS:000331805400008	0.351754
WOS:000303900900050	WOS:000287717100007	0.350625
WOS:000348055700016	WOS:000346545700024	0.340893

17.5 关联词汇

前面我们计算了过滤词汇text_filted，我们现在研究这些词汇之间的关联

text_count <- text_filted %>% 
  count(category, word1, word2, sort = TRUE) %>% 
  select(word1, word2, n, category)  %>% 
  arrange(category)
text_count

ABCDEFGHIJ0123456789

word1 <chr>	word2 <chr>	n <int>	category <chr>
piezoelectric	properties	65	材料科学
rights	reserved	58	材料科学
lead	free	45	材料科学
elsevier	b.v	36	材料科学
sintering	temperature	33	材料科学
phase	transition	32	材料科学
perovskite	structure	31	材料科学
cm	2	30	材料科学
ray	diffraction	30	材料科学
solid	solution	28	材料科学

text_count %>% 
  filter(n  > 20) %>% 
  as_tbl_graph()

## # A tbl_graph: 124 nodes and 94 edges
## #
## # A directed multigraph with 46 components
## #
## # Node Data: 124 x 1 (active)
##   name         
##   <chr>        
## 1 piezoelectric
## 2 rights       
## 3 lead         
## 4 elsevier     
## 5 sintering    
## 6 phase        
## # ... with 118 more rows
## #
## # Edge Data: 94 x 4
##    from    to     n category
##   <int> <int> <int> <chr>   
## 1     1    74    65 材料科学
## 2     2    75    58 材料科学
## 3     3    52    45 材料科学
## # ... with 91 more rows

text_count %>% 
  filter(n  > 20) %>% 
  as_tbl_graph() %>% 
  ggraph(layout = "fr") +
  geom_node_point() +
  geom_edge_link(aes(color = category, edge_width = n)) +
  geom_node_text(aes(label = name), repel = TRUE) +
  #facet_wrap(vars(category), ncol = 3, scales = "free")
  facet_edges(vars(category), ncol = 3, scales = "free")

17.6 下一步工作

数据量很多，需要精炼，从而提前有用的关键信息
还没想好

http://tidytextmining.com/ngrams.html ↩︎

第 17 章 利用文本挖掘技术分析文献摘要

17.1 数据导入

17.2 数据规整

17.3 计算tf_idf

17.4 文本相似性

17.5 关联词汇

17.6 下一步工作

第 17 章利用文本挖掘技术分析文献摘要