第 17 章 利用文本挖掘技术分析文献摘要
library(tidyverse)
library(here)
library(fs)
library(purrr)
library(tidytext)
library(widyr)
library(tidygraph)
library(ggraph)
本章基于R语言文本挖掘技术,分析文献摘要。具体做法是,利用R语言tidyverse、tidytext、widyr、tidygraph、ggraph等宏包分析我校文献(Web of Science)摘要的文本信息。6
17.1 数据导入
为了研究的可重复性,我列出了数据获取步骤: - 打开https://www.webofknowledge.com/,进入核心合集 - 输入学校全名:比如 Sichuan Normal University - 选择“机构扩展”检索 - 选择时间范围:“2009-2018年” - 选择“SCI/SSCI/A&HCI” - 点击检索 - 文档类型精炼:”Article + Review “ - 一次显示最多 50 条,一次下载最多 500 条 - 选择“其他类型下载” + “全记录与引用的参考文献” + “win UTF” - 依此下载保存
我们共获取了 1988 条文献题录数据。
<- function(flnm) {
read_plus read_tsv(flnm, quote = "", col_names = TRUE) %>%
#or
#read_delim(flnm, delim="\t" , quote = "", col_names = TRUE) %>%
#select(AU, AF, SO, DE, C1, RP, FU, CR, TC, SN, PY, UT) %>%
select(AB, SN, UT) #%>%
# mutate(University = flnm %>% # 加入了学校名
# str_split("/", 7, simplify = TRUE) %>%
# .[, 6] %>%
# str_sub(start = 4)
# )
}
<- here("data", "newdata") %>%
tbl dir_ls(regexp = "*.txt", recursive = TRUE) %>%
map_dfr(~read_plus(.))
tbl
AB <chr> | |
---|---|
Four antimony fluoride sulfates named A(2)SO(4)SbF(3) (A = Na+, NH4+, K+, Rb+) have been successfully synthesized using a hydrothermal method by introducing Sb3+ cations with a stereochemically active lone pair in sulfates and subsequently tuning the structure through the second monovalent cations. All of the title compounds are stoichiometrically equivalent materials which share a common structural motif composed of a distorted SO4 tetrahedron and an asymmetric SbF3 polyhedron. However, the macroscopic centricities of these four compounds are significantly influenced by the size and coordination environment of cations; Na2SO4SbF3 crystallizes in centrosymmetric space groups Cmca and (NH4)(2)SO4SbF3 in Pbca, while K2SO4SbF3 and Rb2SO4SbF3 crystallizes in noncentrosymmetric space group P2(1)2(1)2(1). Complete characterization including thermal analyses, infrared and UV-vis spectroscopy, and theoretical calculations is also reported. Powder second harmonic generation measurement for noncentrosymmetric K2SO4SbF3 and Rb2SO4SbF3 indicated that both of them are type I phase-matchable. | |
A new species of Yunnanilus is described from Tuojiang River, Sichuan, China. The new species, Yunnanilus jiuchiensis, can be distinguished from other species of Yunnanilus by the following combination of characters: processus dentiformis absent; body covered with scarce scales; lateral-line incomplete, as long as half the length of the pectoral-fin length, with 6-11 pores; eye diameter larger than interorbital width; and caudal-peduncle length less than its depth. | |
Outside soil spray seeding (OSSS) is used widely for road cut revegetation, and the artificial soil used in OSSS can improve slope soil conditions and nutrients, and help promote plant growth and succession. Three different slopes was investigated to evaluate the effectiveness of OSSS for restoration, including a natural slope (NS), a cut slope without any artificial recovery treatment (CSW) and a cut slope treated with OSSS (CSO). The recovery of cut slopes was determined by evaluating a number of factors, including indices associated with plants on the slopes, soil enzyme activities (urease and sucrase), and soil nutrient content (soil organic matter (SOM), total phosphorous (TP), total potassium (TK), available nitrogen (AN), available phosphorous (AP), available potassium (AK), potassium (K+), calcium (Ca2+), magnesium (Mg2+), and sulphate (SO42-)). The results indicated that the vegetation and soil conditions differed between the three slopes. The Shannon-Wiener index (H), the Simpson index (D), and the Margalef index (R) values from the CSO and NS were lower than those of the CSW, whilst the Pielou index (E) value and vegetation canopy cover were higher for the CSO and NS than for the CSW. The content of SOM and AN in soil from the CSO was lower than in soil from the NS and CSW, and content of many nutrients were higher in soil from the CSO than in soil from the NS and CSW. This suggests that the restoration of vegetation and soil nutrients on the CSO was relatively successful. Our results indicated that the use of OSSS to restore cut slopes is effective in plateau areas. However, despite improvements in soil nutrient levels, there were still nutritional imbalances. Therefore, more attention should be paid to balancing nutrients in the later stage of OSSS implementation for the recovery of cut slopes at high altitudes. | |
To study the treatment effect and mechanism of a combined microwave (MW)-Fe-0/H2O2 Fenton-like process on concentrated landfill leachate, the effects of initial pH, Fe-0 dosage, H2O2 dosage, MW power and reaction time on the removal of organic substances were investigated. The phase change of Fe-0 before and after reaction and its catalytic mechanism were investigated using multiple analytical techniques. Results showed that the removal efficiencies of chemical oxygen demand, UV254 and color number were 58.70%, 85.69% and 88.30%, respectively, at initial pH of 2.0, Fe-0 dosage of 0.5 g/L, H2O2 dosage of 20 mL/L, MW power of 400 W and reaction time of 14 min. Comparison of different Fenton-like processes indicated that the MW-Fe-0/H2O2 Fenton-like process was the most efficient and significantly decreased the aromaticity degree, molecular weight and condensation degree of organic substances in the leachate. The fluorescence peak of concentrated leachate exhibited a blue-shift in the MW-Fe-0/H2O2 process, further indicating that the condensation degree of humic substances declined and molecular weight remarkably decreased. The mechanism exhibited an advanced oxidation effect of a heterogeneous Fenton reaction between iron oxide and H2O2, as well as of adsorption and precipitation effects of iron-based colloids, on organic substances. Moreover, thermal and non-thermal effects of MW accelerated these reactions, achieving fast removal of organic pollutants in the concentrated landfill leachate. Overall, the results of this study showe that the MW-Fe-0/H2O2 process is an effective and promising method to handle concentrated landfill leachate. | |
We performed the first-principles calculations on the elastic and thermal properties for chalcopyrite ZnSnX2 (X = P, As, Sb), employing the ultrasoft pseudo-potentials and generalized gradient approximation (GGA) under the frame of density functional theory. The equilibrium structural lattice constants are in good agreement with reported data. The elastic characteristics were evaluated under high-pressure condition (0-20 GPa), such as the elastic constants, bulk modulus, shear modulus, Poisson's ratio, Zener anisotropy and compressibility index. Combining with quasi-harmonic Debye model, the thermal properties were confirmed at different temperatures (0-1200 K) and pressures (0-20 GPa), including the heat capacity, thermal expansion, Debye temperature, entropy, and Gruneisen parameter. Based on the semi-empirical relation, the hardness of materials was determined at various temperatures and pressures. Finally, the phonon spectrum curves and vibration frequencies of phonon were evaluated to confirm the thermodynamic stability of ZnSnX2. The Raman scattering spectrum and infrared absorption spectrum were simulated for chalcopyrite ZnSnX2. | |
We report the preparation of an ammonia borane hydrolysis catalyst for use in hydrogen production by dispersing Rh nanoparticles on a nitrogen-doped carbon (NPC) support. The resulting Rh/NPC catalyst had a measured turnover frequency of 473.5 min(-1), higher than that of many previously reported Rh-based catalysts. This catalyst could also be reused eight times. The large surface area and abundant nitrogen-functional species of NPCs facilitate dispersion of Rh nanoparticles on their surface, providing numerous catalytically active sites for ammonia borane hydrolysis, thereby leading to high catalytic activity. This study demonstrates that NPC support can be used to prepare highly active catalysts. (C) 2018 Hydrogen Energy Publications LLC. Published by Elsevier Ltd. All rights reserved. | |
The Himalayan Monal Lophophorus impejanus is listed as National First Class Protected Animal in China, and also listed as Near Threatened species recently by the red list of China's vertebrates. In this study, the complete mitogenome sequence of the Himalayan Monal was determined for the first time. The mitogenome is a circular molecule of 16,709 bp in length, containing 13 protein-coding genes, two ribosome RNA genes, 22 transfer RNA genes and one non-coding regions. We also examine its phylogenetic position with respect to other eight Galliformes species. Tree constructed using Bayesian phylogenetic methods demonstrated L. impejanus as a sister to Lophophorus lhuysii. Our data would provide useful information for application in conservation genetics and evolution for the threatened species. | |
It is well known in [Absolutely pure modules, Proc. Amer. Math. Soc. 26 (1970) 561-566, Theorem 6] that a domain R is a Prufer domain if and only if every divisible R-module is absolutely pure, where an R-module A is called absolutely pure if Ext(R)(1) (N, A) = 0 for every finitely presented R-module N. In this paper, we extend this result to Prufer v-multiplication domains (PvMDs). To do this, comparing with ]An Introduction to Homological Algebra, 2nd edn. (Springer, Science+Business Media, LLC, New York, 2009), Theorem 3.69], we firstly give homological characterizations of w-purity, and we introduce the concept of absolutely w-pure modules over commutative rings with zero divisors. Finally, we prove that a domain R is a PvMD if and only if every divisible R-module is absolutely w-pure, and compare absolutely w-purity with absolutely purity by giving an example. | |
In this paper, we expand the Hamy mean (HM) operator and Dombi operations with interval-valued intuitionistic fuzzy numbers (IVIFNs) to propose the interval-valued intuitionistic fuzzy Dombi Hamy mean (IVIFDHM) operator, interval-valued intuitionistic fuzzy weighted Dombi Hamy mean (IVIFWDHM) operator, interval-valued intuitionistic fuzzy dual Dombi Hamy mean (IVIFDDHM) operator, and interval-valued intuitionistic fuzzy weighted dual Dombi Hamy mean (IVIFWDDHM) operator. Then the MADM models are designed with IVIFWDHM and IVIFWDDHM operators. Finally, we gave an example for evaluating the elderly tourism service quality in tourism destination to show the proposed models. | |
Carbon-based supercapacitor is one of the most promising energy conversion devices due to its ultrahigh power density and superior cycling durability, but most of carbon materials for high performance supercapacitor may involve high cost, sophisticated chemical procedures or tedious fabrication processes. Herein, a reproducible biomass-derived porous carbon with efficient ion-accessible surface and high content of heteroatoms has been successfully prepared by a simple high-temperature pyrolysis process. The facile chemical activation enables the as-synthesized materials own a hierarchical porous structure with an ideal pore size distribution and high contents of nitrogen (0.99 at%) and oxygen (8.99 at%), which is conducive to the high-efficiency transfer of electrolyte ions and enhancement in electrical conductivity of the materials. The as-fabricated hierarchical porous carbon materials deliver excellent specific capacitance of 287.1 F g(-1) at 1 A g(-1) and admirable cycling durability of 99.0% at current density of 1 A g(-1) after 10,000 cycles in 6.0 M KOH electrolyte. Remarkably, the assembled symmetric supercapacitor exhibits an excellent energy density of 43.0 Wh kg(-1) at power density of 875.0 W kg(-1) in ionic liquid electrolyte. This study shows that low cost porous carbon materials derived from biomass source by a facile pyrolysis might be a great option to fabricate high performance energy conversion device. |
<- read_rds(here("data", "esiJournalsList", "esi_plus_cas_IF_set.rds"))
esi_plus_cas_IF_set
<- tbl %>%
complete_set left_join(esi_plus_cas_IF_set, by = c("SN" = "ISSN") ) %>%
select(category = Category_ESI_cn, abstract = AB, pubs = UT) #%>%
# rename(ISSN = SN) %>%
17.2 数据规整
complete_set
category <chr> | |
---|---|
化学 | |
植物学与动物学 | |
环境科学与生态学 | |
工程学 | |
物理学 | |
工程学 | |
环境科学与生态学 | |
数学 | |
NA | |
化学 |
数据整理和文本分词
<- complete_set %>%
text_df filter(!is.na(abstract)) %>%
unnest_tokens(output = grams, input = abstract, token = "ngrams", n = 2)
text_df
category <chr> | pubs <chr> | grams <chr> | ||
---|---|---|---|---|
化学 | WOS:000453550900030 | four antimony | ||
化学 | WOS:000453550900030 | antimony fluoride | ||
化学 | WOS:000453550900030 | fluoride sulfates | ||
化学 | WOS:000453550900030 | sulfates named | ||
化学 | WOS:000453550900030 | named a | ||
化学 | WOS:000453550900030 | a 2 | ||
化学 | WOS:000453550900030 | 2 so | ||
化学 | WOS:000453550900030 | so 4 | ||
化学 | WOS:000453550900030 | 4 sbf | ||
化学 | WOS:000453550900030 | sbf 3 |
过滤无用词汇
<- text_df %>%
text_filted separate(grams, into = c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
text_filted
category <chr> | pubs <chr> | word1 <chr> | word2 <chr> | |
---|---|---|---|---|
化学 | WOS:000453550900030 | antimony | fluoride | |
化学 | WOS:000453550900030 | fluoride | sulfates | |
化学 | WOS:000453550900030 | sulfates | named | |
化学 | WOS:000453550900030 | 4 | sbf | |
化学 | WOS:000453550900030 | sbf | 3 | |
化学 | WOS:000453550900030 | na | nh4 | |
化学 | WOS:000453550900030 | successfully | synthesized | |
化学 | WOS:000453550900030 | hydrothermal | method | |
化学 | WOS:000453550900030 | introducing | sb3 | |
化学 | WOS:000453550900030 | sb3 | cations |
<- text_filted %>%
text_unite unite(grams, word1, word2, sep = " ")
text_unite
category <chr> | pubs <chr> | grams <chr> | ||
---|---|---|---|---|
化学 | WOS:000453550900030 | antimony fluoride | ||
化学 | WOS:000453550900030 | fluoride sulfates | ||
化学 | WOS:000453550900030 | sulfates named | ||
化学 | WOS:000453550900030 | 4 sbf | ||
化学 | WOS:000453550900030 | sbf 3 | ||
化学 | WOS:000453550900030 | na nh4 | ||
化学 | WOS:000453550900030 | successfully synthesized | ||
化学 | WOS:000453550900030 | hydrothermal method | ||
化学 | WOS:000453550900030 | introducing sb3 | ||
化学 | WOS:000453550900030 | sb3 cations |
17.3 计算tf_idf
<- text_unite %>%
text_tf_idf count(pubs, grams) %>%
bind_tf_idf(pubs, grams, n) %>%
arrange(desc(tf_idf))
text_tf_idf
pubs <chr> | grams <chr> | n <int> | tf <dbl> | idf <dbl> | tf_idf <dbl> |
---|---|---|---|---|---|
WOS:000306005400005 | matijevic's result | 1 | 1 | 10.078 | 10.078 |
WOS:000335829800004 | generalized krull | 2 | 1 | 10.078 | 10.078 |
WOS:000303447200007 | coherent ring | 1 | 1 | 9.673 | 9.673 |
WOS:000303447200007 | yoke module | 2 | 1 | 9.673 | 9.673 |
WOS:000287359600007 | semigroup ring | 1 | 1 | 9.385 | 9.385 |
WOS:000332072200002 | variational characteristic | 1 | 1 | 9.385 | 9.385 |
WOS:000363286700011 | coherent integrally | 1 | 1 | 9.385 | 9.385 |
WOS:000363286700011 | fp id | 1 | 1 | 9.385 | 9.385 |
WOS:000397073100009 | finite set | 1 | 1 | 9.385 | 9.385 |
WOS:000397073100009 | tan 2016 | 1 | 1 | 9.385 | 9.385 |
%>% group_by(pubs) %>%
text_tf_idf filter(max(tf_idf) > 9.1) %>%
#dplyr::distinct(pubs)
ggplot(aes(x = fct_reorder(grams, tf_idf), y = tf_idf, fill = pubs)) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(vars(pubs), ncol = 2, scales = "free")
17.4 文本相似性
similarity=cos(θ)=A⋅B‖
%>%
text_tf_idf pairwise_similarity(pubs, grams, tf_idf, upper = FALSE, sort = TRUE)
item1 <chr> | item2 <chr> | similarity <dbl> | ||
---|---|---|---|---|
WOS:000305196500008 | WOS:000304137900014 | 0.879913 | ||
WOS:000304578000008 | WOS:000307572600001 | 0.603016 | ||
WOS:000406149900040 | WOS:000423704100022 | 0.566003 | ||
WOS:000281250400006 | WOS:000300573100007 | 0.561753 | ||
WOS:000411449700011 | WOS:000428369700003 | 0.519942 | ||
WOS:000335200600020 | WOS:000313207400037 | 0.386541 | ||
WOS:000367544600012 | WOS:000341472000006 | 0.360928 | ||
WOS:000319082900005 | WOS:000331805400008 | 0.351754 | ||
WOS:000303900900050 | WOS:000287717100007 | 0.350625 | ||
WOS:000348055700016 | WOS:000346545700024 | 0.340893 |
17.5 关联词汇
前面我们计算了过滤词汇text_filted
,我们现在研究这些词汇之间的关联
<- text_filted %>%
text_count count(category, word1, word2, sort = TRUE) %>%
select(word1, word2, n, category) %>%
arrange(category)
text_count
word1 <chr> | word2 <chr> | n <int> | category <chr> | |
---|---|---|---|---|
piezoelectric | properties | 65 | 材料科学 | |
rights | reserved | 58 | 材料科学 | |
lead | free | 45 | 材料科学 | |
elsevier | b.v | 36 | 材料科学 | |
sintering | temperature | 33 | 材料科学 | |
phase | transition | 32 | 材料科学 | |
perovskite | structure | 31 | 材料科学 | |
cm | 2 | 30 | 材料科学 | |
ray | diffraction | 30 | 材料科学 | |
solid | solution | 28 | 材料科学 |
%>%
text_count filter(n > 20) %>%
as_tbl_graph()
## # A tbl_graph: 124 nodes and 94 edges
## #
## # A directed multigraph with 46 components
## #
## # Node Data: 124 x 1 (active)
## name
## <chr>
## 1 piezoelectric
## 2 rights
## 3 lead
## 4 elsevier
## 5 sintering
## 6 phase
## # ... with 118 more rows
## #
## # Edge Data: 94 x 4
## from to n category
## <int> <int> <int> <chr>
## 1 1 74 65 材料科学
## 2 2 75 58 材料科学
## 3 3 52 45 材料科学
## # ... with 91 more rows
%>%
text_count filter(n > 20) %>%
as_tbl_graph() %>%
ggraph(layout = "fr") +
geom_node_point() +
geom_edge_link(aes(color = category, edge_width = n)) +
geom_node_text(aes(label = name), repel = TRUE) +
#facet_wrap(vars(category), ncol = 3, scales = "free")
facet_edges(vars(category), ncol = 3, scales = "free")