Example workflow and report

In this section we present how to load Web of Science data using nails package functions and then to create an example report using ggplot2-based visualizations.

Loading data

Below is an example of how data exported from Web of Science can be loaded and parsed using the nails package functions.

# Setup

# Load packages
devtools::load_all()
require(ggplot2)

# Set ggplot theme
theme_set(theme_minimal(12))

# Load data
literature <- read_wos_data("../tests/testthat/test_data")

# Clean data
literature <- clean_wos_data(literature)

Generating visualizations with knittr

Below we present how to generate example report using nails calls and then using ggplot2 and knittr to generate visual reports.

This report provides an analysis on the records downloaded from Web of Science. The analysis identifies the important authors, journals, and keywords in the dataset based on the number of occurences and citation counts. A citation network of the provided records is created and used to identify the important papers according to their in-degree, total citation count and PageRank scores. The analysis finds also often-cited references that were not included in the original dataset downloaded from the Web of Science.

Reports can also be generated by using the online analysis service, and the source code is available at GitHub. Instructions and links to tutorial videos can be found at the project page. Please consider citing our research paper on bibliometrics at if you publish the analysis results.

# Setup

# Load packages
devtools::load_all()
require(ggplot2)

# Set ggplot theme
theme_set(theme_minimal(12))

The analysed dataset, loaded in section “loading data”, consist of 600 records with 69 variables. More information about the variables can be found at Web of Science.

Publication years

ggplot(literature, aes(YearPublished)) + geom_histogram(binwidth = 1, fill = "darkgreen") + 
    ggtitle("Year published") + xlab("Year") + ylab("Article count")

# Calculate relative publication counts yearTable <-
# as.data.frame(table(literature$YearPublished)) names(yearDF) <- c('Year',
# 'Freq') # Fix column names

# Merge to dataframe of total publication numbers (years) yearDF <-
# merge(yearDF, years, by.x = 'Year', by.y = 'Year', all.x = TRUE)
# yearDF$Year <- as.numeric(as.character(yearDF$Year)) # factor to numeric
# Calculate published articles per total articles by year yearDF$Fraction <-
# yearDF$Freq / yearDF$Records

Relative publication volume

# ADD PLOT HERE!
print("Placeholder")

## [1] "Placeholder"

Important authors

Sorted by the number of articles published and by the total number of citations.

# Get author network nodes, which contain the required information
author_network <- get_author_network(literature)
author_nodes <- author_network$author_nodes
# Change Id to AuthorFullName
names(author_nodes)[names(author_nodes) == "Id"] <- "AuthorFullName"

# Sort by number of articles by author
author_nodes <- author_nodes[with(author_nodes, order(-Freq)), ]
# Re-order factor levels
author_nodes <- transform(author_nodes, AuthorFullName = reorder(AuthorFullName, 
    Freq))

ggplot(head(author_nodes, 25), aes(AuthorFullName, Freq)) + geom_bar(stat = "identity", 
    fill = "blue") + coord_flip() + ggtitle("Productive authors") + xlab("Author") + 
    ylab("Number of articles")

# Reorder AuthorFullName factor according to TotalTimesCited (decreasing
# order)
author_nodes <- transform(author_nodes, AuthorFullName = reorder(AuthorFullName, 
    TotalTimesCited))

# Sort by number of articles by author
author_nodes <- author_nodes[with(author_nodes, order(-TotalTimesCited)), ]

ggplot(head(author_nodes, 25), aes(AuthorFullName, TotalTimesCited)) + geom_bar(stat = "identity", 
    fill = "blue") + coord_flip() + ggtitle("Most cited authors") + xlab("Author") + 
    ylab("Total times cited")

Important publications

Sorted by number of published articles in the dataset and by the total number of citations.

# Calculate publication occurences
publications <- as.data.frame(table(literature$PublicationName))

# Fix names
names(publications) <- c("Publication", "Count")

# Trim publication name to maximum of 50 characters for displaying in plot
publications$Publication <- strtrim(publications$Publication, 50)

# Sort descending
publications <- publications[with(publications, order(-Count)), ]

# Reorder factor levels
publications <- transform(publications, Publication = reorder(Publication, Count))


# WHY???  literature <- merge(literature, citation_sums, by =
# 'PublicationName' )

ggplot(head(publications, 25), aes(Publication, Count)) + geom_bar(stat = "identity", 
    fill = "orange") + coord_flip() + theme(legend.position = "none") + ggtitle("Most popular publications") + 
    xlab("Publication") + ylab("Article count")

# Calculating total citations for each publication.
citation_sums <- aggregate(literature$TimesCited, by = list(PublicationName = literature$PublicationName), 
    FUN = sum, na.rm = T)

# Fix column names
names(citation_sums) <- c("PublicationName", "PublicationTotalCitations")

# Trim publication name to maximum of 50 characters for displaying in plot
citation_sums$PublicationName <- strtrim(citation_sums$PublicationName, 50)

# Sort descending and reorder factor levels accordingly
citation_sums <- citation_sums[with(citation_sums, order(-PublicationTotalCitations)), 
    ]
citation_sums <- transform(citation_sums, PublicationName = reorder(PublicationName, 
    PublicationTotalCitations))
ggplot(head(citation_sums, 25), aes(PublicationName, PublicationTotalCitations)) + 
    geom_bar(stat = "identity", fill = "orange") + coord_flip() + theme(legend.position = "none") + 
    ggtitle("Most cited publications") + xlab("Publication") + ylab("Total times cited")

Important keywords

Sorted by the number of articles where the keyword is mentioned and by the total number of citations for the keyword.

# Calculating total citations for each keyword

literature_by_keywords <- arrange_by(literature, "AuthorKeywords")



# Sometimes AuthorKeywords column is empty.  Following if-else hack prevents
# crashing in those situations, either by using KeywordsPlus column or
# skipping keyword analysis.
if (nrow(literature_by_keywords) == 0) {
    cat("No keywords.")
} else {
    keyword_citation_sum <- aggregate(literature_by_keywords$TimesCited, by = list(AuthorKeywords = literature_by_keywords$AuthorKeywords), 
        FUN = sum, na.rm = T)
    names(keyword_citation_sum) <- c("AuthorKeywords", "TotalTimesCited")
    
    keywords <- unlist(strsplit(literature$AuthorKeywords, ";"))
    keywords <- trim(keywords)
    keywords <- as.data.frame(table(keywords))
    names(keywords) <- c("AuthorKeywords", "Freq")
    
    keywords <- merge(keywords, keyword_citation_sum, by = "AuthorKeywords")
    keywords <- keywords[with(keywords, order(-Freq)), ]
    keywords <- transform(keywords, AuthorKeywords = reorder(AuthorKeywords, 
        Freq))
    
    ggplot(head(keywords, 25), aes(AuthorKeywords, Freq)) + geom_bar(stat = "identity", 
        fill = "purple") + coord_flip() + ggtitle("Popular keywords") + xlab("Keyword") + 
        ylab("Number of occurences")
}

if (nrow(literature_by_keywords) > 0) {
    keywords <- keywords[with(keywords, order(-TotalTimesCited)), ]
    keywords <- transform(keywords, AuthorKeywords = reorder(AuthorKeywords, 
        TotalTimesCited))
    ggplot(head(keywords, 25), aes(AuthorKeywords, TotalTimesCited)) + geom_bar(stat = "identity", 
        fill = "purple") + coord_flip() + ggtitle("Most cited keywords") + xlab("Keyword") + 
        ylab("Total times cited")
}

Important papers

The most important papers and other sources are identified below using three importance measures: 1) in-degree in the citation network, 2) citation count provided by Web of Science (only for papers included in the dataset), and 3) PageRank score in the citation network. The top 25 highest scoring papers are identified using these measures separately. The results are then combined and duplicates are removed. Results are sorted by in-degree, and ties are first broken by citation count and then by the PageRank.

When a Digital Object Identifier (DOI) is available, the full paper can be found using Resolve DOI website.

# Extract citation nodes
citation_network <- get_citation_network(literature)
citation_nodes <- citation_network$citation_nodes


# Extract the articles included in the data set and articles not included in
# the dataset
citations_lit <- citation_nodes[citation_nodes$Origin == "literature", ]
citations_ref <- citation_nodes[citation_nodes$Origin == "reference", ]

# Create article strings (document title, reference information and abstract
# separated by '|')
citations_lit$Article <- paste(toupper(citations_lit$DocumentTitle), " | ", 
    citations_lit$FullReference, " | ", citations_lit$Abstract)

Included in the dataset

These papers were included in the 600 records downloaded from the Web of Science.

# Sort citations_lit by TimesCited, decreasing
citations_lit <- citations_lit[with(citations_lit, order(-TimesCited)), ]
# Extract top 25
top_lit <- head(citations_lit, 25)
# Sort by InDegree, decreasing
citations_lit <- citations_lit[with(citations_lit, order(-InDegree)), ]
# Add to list of top 25 most cited papers
top_lit <- rbind(top_lit, head(citations_lit, 25))
# Sort by PageRank, decreasing
citations_lit <- citations_lit[with(citations_lit, order(-PageRank)), ]
# Add to list of most cited and highest InDegree papers
top_lit <- rbind(top_lit, head(citations_lit, 25))
# Remove duplicates
top_lit <- top_lit[!duplicated(top_lit[, "FullReference"]), ]
# Sort top_lit by InDegree, break ties by TimesCited, then PageRank.
top_lit <- top_lit[with(top_lit, order(-InDegree, -TimesCited, -PageRank)), 
    ]
# Print list
knitr::kable(top_lit[, c("Article", "InDegree", "TimesCited", "PageRank")])

	Article	InDegree	TimesCited	PageRank
31810	DIFFERENTIATION OF DIGESTIVE SYSTEM CANCERS BY USING SERUM PROTEIN-BASED SURFACE-ENHANCED RAMAN SPECTROSCOPY \| LIN J, 2017, J RAMAN SPECTROSC, V48, P16, DOI 10.1002/JRS.4982 \| The aim of this study is to develop a more robust surface-enhanced Raman spectroscopy method for the simultaneous differentiation of two or more different types of cancer. For this, the periodical reversal of current directions is imposed on electrophoresis for the fast purification of serum proteins, and the serum proteins were then mixed with silver nanoparticles for surface-enhanced Raman spectroscopy spectral measurement. The Raman spectra of serum proteins from healthy subjects (n=85) and three types of digestive system cancer - colorectal cancer (n=109), gastric cancer (n=133), and liver cancer (n=38) - were measured for analyses. Principal component analysis and linear discriminant analysis yield diagnostic sensitivities of 94.5, 97.0, and 89.1% and specificities of 98.1, 92.7, and 99.2% in the differentiation of colorectal, gastric, and liver cancers, respectively. This work marks a promising step toward the potential applications of Raman spectroscopic blood analysis in multi-type cancer screening. Copyright (c) 2016 John Wiley & Sons, Ltd.	2	3	4.14e-05
4109	PREDICTING MALIGNANT NODULES FROM SCREENING CT SCANS \| HAWKINS S, 2016, J THORAC ONCOL, V11, P2120, DOI 10.1016/J.JTHO.2016.07.002 \| Objectives: The aim of this study was to determine whether quantitative analyses (“radiomics”) of low-dose computed tomography lung cancer screening images at baseline can predict subsequent emergence of cancer. Methods: Public data from the National Lung Screening Trial (ACRIN 6684) were assembled into two cohorts of 104 and 92 patients with screen-detected lung cancer and then matched with cohorts of 208 and 196 screening subjects with benign pulmonary nodules. Image features were extracted from each nodule and used to predict the subsequent emergence of cancer. Results: The best models used 23 stable features in a random forests classifier and could predict nodules that would become cancerous 1 and 2 years hence with accuracies of 80% (area under the curve 0.83) and 79% (area under the curve 0.75), respectively. Radiomics outperformed the Lung Imaging Reporting and Data System and volume-only approaches. The performance of the McWilliams risk assessment model was commensurate. Conclusions: The radiomics of lung cancer screening computed tomography scans at baseline can be used to assess risk for development of cancer. (C) 2016 International Association for the Study of Lung Cancer. Published by Elsevier Inc. All rights reserved.	1	6	4.09e-05
46310	INTRATUMOR PARTITIONING AND TEXTURE ANALYSIS OF DYNAMIC CONTRAST-ENHANCED (DCE)-MRI IDENTIFIES RELEVANT TUMOR SUBREGIONS TO PREDICT PATHOLOGICAL RESPONSE OF BREAST CANCER TO NEOADJUVANT CHEMOTHERAPY \| WU J, 2016, J MAGN RESON IMAGING, V76, P1107, DOI 10.1002/JMRI.25279 \| PurposeTo predict pathological response of breast cancer to neoadjuvant chemotherapy (NAC) based on quantitative, multiregion analysis of dynamic contrast enhancement magnetic resonance imaging (DCE-MRI). Materials and MethodsIn this Institutional Review Board-approved study, 35 patients diagnosed with stage II/III breast cancer were retrospectively investigated using 3T DCE-MR images acquired before and after the first cycle of NAC. First, principal component analysis (PCA) was used to reduce the dimensionality of the DCE-MRI data with high temporal resolution. We then partitioned the whole tumor into multiple subregions using k-means clustering based on the PCA-defined eigenmaps. Within each tumor subregion, we extracted four quantitative Haralick texture features based on the gray-level co-occurrence matrix (GLCM). The change in texture features in each tumor subregion between pre- and during-NAC was used to predict pathological complete response after NAC. ResultsThree tumor subregions were identified through clustering, each with distinct enhancement characteristics. In univariate analysis, all imaging predictors except one extracted from the tumor subregion associated with fast washout were statistically significant (P < 0.05) after correcting for multiple testing, with area under the receiver operating characteristic (ROC) curve (AUC) or AUCs between 0.75 and 0.80. In multivariate analysis, the proposed imaging predictors achieved an AUC of 0.79 (P = 0.002) in leave-one-out cross-validation. This improved upon conventional imaging predictors such as tumor volume (AUC = 0.53) and texture features based on whole-tumor analysis (AUC = 0.65). ConclusionThe heterogeneity of the tumor subregion associated with fast washout on DCE-MRI predicted pathological response to NAC in breast cancer. J. Magn. Reson. Imaging 2016;44:1107-1115.	1	4	4.01e-05
35110	MULTI-CROP CONVOLUTIONAL NEURAL NETWORKS FOR LUNG NODULE MALIGNANCY SUSPICIOUSNESS CLASSIFICATION \| SHEN W, 2017, PATTERN RECOGN, V61, P663, DOI 10.1016/J.PATCOG.2016.05.029 \| We investigate the problem of lung nodule malignancy suspiciousness (the likelihood of nodule malignancy) classification using thoracic Computed Tomography (CT) images. Unlike traditional studies primarily relying on cautious nodule segmentation and time-consuming feature extraction, we tackle a more challenging task on directly modeling raw nodule patches and building an end-to-end machine learning architecture for classifying lung nodule malignancy suspiciousness. We present a Multi-crop Convolutional Neural Network (MC-CNN) to automatically extract nodule salient information by employing a novel multi-crop pooling strategy which crops different regions from convolutional feature maps and then applies max-pooling different times. Extensive experimental results show that the proposed method not only achieves state-of-the-art nodule suspiciousness classification performance, but also effectively characterizes nodule semantic attributes (subtlety and margin) and nodule diameter which are potentially helpful in modeling nodule malignancy. (C) 2016 Elsevier Ltd. All rights reserved.	1	2	4.04e-05
34710	LARGE SCALE DEEP LEARNING FOR COMPUTER AIDED DETECTION OF MAMMOGRAPHIC LESIONS \| KOOI T, 2017, MED IMAGE ANAL, V35, P303, DOI 10.1016/J.MEDIA.2016.07.007 \| Recent advances in machine learning yielded new techniques to train deep neural networks, which resulted in highly successful applications in many pattern recognition tasks such as object detection and speech recognition. In this paper we provide a head-to-head comparison between a state-of-the art in mammography CAD system, relying on a manually designed feature set and a Convolutional Neural Network (CNN), aiming for a system that can ultimately read mammograms independently. Both systems are trained on a large data set of around 45,000 images and results show the CNN outperforms the traditional CAD system at low sensitivity and performs comparable at high sensitivity. We subsequently investigate to what extent features such as location and patient information and commonly used manual features can still complement the network and see improvements at high specificity over the CNN especially with location and context features, which contain information not available to the CNN. Additionally, a reader study was performed, where the network was compared to certified screening radiologists on a patch level and we found no significant difference between the network and the readers. (C) 2016 Elsevier B.V. All rights reserved.	1	2	4.01e-05
16647	CORRELATION OF LIPIDOMIC COMPOSITION OF CELL LINES AND TISSUES OF BREAST CANCER PATIENTS USING HYDROPHILIC INTERACTION LIQUID CHROMATOGRAPHY/ELECTROSPRAY IONIZATION MASS SPECTROMETRY AND MULTIVARIATE DATA ANALYSIS \| CIFKOVA E, 2017, RAPID COMMUN MASS SP, V31, P253, DOI 10.1002/RCM.7791 \| RationaleThe goal of this work is the comparison of differences in the lipidomic compositions of human cell lines derived from normal and cancerous breast tissues, and tumor vs. normal tissues obtained after the surgery of breast cancer patients. MethodsHydrophilic interaction liquid chromatography/electrospray ionization mass spectrometry (HILIC/ESI-MS) using the single internal standard approach and response factors is used for the determination of relative abundances of individual lipid species from five lipid classes in total lipid extracts of cell lines and tissues. The supplementary information on the fatty acyl composition is obtained by gas chromatography/mass spectrometry (GC/MS) of fatty acid methyl esters. Multivariate data analysis (MDA) methods, such as nonsupervised principal component analysis (PCA), hierarchical clustering analysis (HCA) and supervised orthogonal partial least-squares discriminant analysis (OPLS-DA), are used for the visualization of differences between normal and tumor samples and the correlation of similarity between cell lines and tissues either for tumor or normal samples. ResultsMDA methods are used for differentiation of sample groups and also for identification of the most up- and downregulated lipids in tumor samples in comparison to normal samples. Observed changes are subsequently generalized and correlated with data from tumor and normal tissues of breast cancer patients. In total, 123 lipid species are identified based on their retention behavior in HILIC and observed ions in ESI mass spectra, and relative abundances are determined. ConclusionsMDA methods are applied for a clear differentiation between tumor and normal samples both for cell lines and tissues. The most upregulated lipids are phospholipids (PL) with a low degree of unsaturation (e.g., 32:1 and 34:1) and also some highly polyunsaturated PL (e.g., 40:6), while the most downregulated lipids are PL containing polyunsaturated fatty acyls (e.g., 20:4), plasmalogens and ether lipids. Copyright (c) 2016 John Wiley & Sons, Ltd.	1	1	4.04e-05
8125	ESTIMATING PERSONALIZED DIAGNOSTIC RULES DEPENDING ON INDIVIDUALIZED CHARACTERISTICS \| LIU Y, 2017, STAT MED, V36, P1099, DOI 10.1002/SIM.7182 \| There is an increasing demand for personalization of disease screening based on assessment of patient risk and other characteristics. For example, in breast cancer screening, advanced imaging technologies have made it possible to move away from one-size-fits-all’ screening guidelines to targeted risk-based screening for those who are in need. Because diagnostic performance of various imaging modalities may vary across subjects, applying the most accurate modality to the patients who would benefit the most requires personalized strategy. To address these needs, we propose novel machine learning methods to estimate personalized diagnostic rules for medical screening or diagnosis by maximizing a weighted combination of sensitivity and specificity across subgroups of subjects. We first develop methods that can be applied when competing modalities or screening strategies that are observed on the same subject (paired design). Next, we present methods for studies where not all subjects receive both modalities (unpaired design). We study theoretical properties including consistency and risk bound of the personalized diagnostic rules and conduct simulation studies to examine performance of the proposed methods. Lastly, we analyze data collected from a brain imaging study of Parkinson’s disease using positron emission tomography and diffusion tensor imaging with paired and unpaired designs. Our results show that in some cases, a personalized modality assignment is estimated to improve empirical area under the receiver operating curve compared with a one-size-fits-all’ assignment strategy. Copyright (c) 2016 John Wiley & Sons, Ltd.	1	1	4.01e-05
34310	DIFFERENTIATING TUMOR HETEROGENEITY IN FORMALIN-FIXED PARAFFIN-EMBEDDED (FFPE) PROSTATE ADENOCARCINOMA TISSUES USING PRINCIPAL COMPONENT ANALYSIS OF MATRIX-ASSISTED LASER DESORPTION/IONIZATION IMAGING MASS SPECTRAL DATA \| PANDERI I, 2017, RAPID COMMUN MASS SP, V31, P160, DOI 10.1002/RCM.7776 \| RationaleMany patients with adenocarcinoma of the prostate present with advanced and metastatic cancer at the time of diagnosis. There is an urgent need to detect biomarkers that will improve the diagnosis and prognosis of this disease. Matrix-assisted laser desorption/ionization imaging mass spectrometry (MALDI-IMS) is playing a key role in cancer research and it can be useful to unravel the molecular profile of prostate cancer biopsies. MethodsMALDI imaging data sets are highly complex and their interpretation requires the use of multivariate statistical methods. In this study, MALDI-IMS technology, sequential principal component analysis (PCA) and two-dimensional (2-D) peak distribution tests were employed to investigate tumor heterogeneity in formalin-fixed paraffin-embedded (FFPE) prostate cancer biopsies. ResultsMultivariate statistics revealed a number of mass ion peaks obtained from different tumor regions that were distinguishable from the adjacent normal regions within a given specimen. These ion peaks have been used to generate ion images and visualize the difference between tumor and normal regions. Mass peaks at m/z 3370, 3441, 3447 and 3707 exhibited stronger ion signals in the tumor regions. ConclusionsThis study reports statistically significant mass ion peaks unique to tumor regions in adenocarcinoma of the prostate and adds to the clinical utility of MALDI-IMS for analysis of FFPE tissue at a molecular level that supersedes all other standard histopathologic techniques for diagnostic purposes used in the current clinical practice. Copyright (c) 2016 John Wiley & Sons, Ltd.	1	1	4.00e-05
9812	DISSECTING TARGET TOXIC TISSUE AND TISSUE SPECIFIC RESPONSES OF IRINOTECAN IN RATS USING METABOLOMICS APPROACH \| YAO Y, 2017, FRONT PHARMACOL, V8, P, DOI 10.3389/FPHAR.2017.00122 \| As an anticancer agent, irinotecan (CPT-11) has been widely applied in clinical, especially in the treatment of colorectal cancer. However, its clinical use has long been limited by the side effects and potential tissue toxicity. To discriminate the target toxic tissues and dissect the specific response of target tissues after CPT-11 administration in rats, untargeted metabolomic study was conducted. First, differential metabolites between CPT-11 treated group and control group in each tissue were screened out. Then, based on fold changes of these differential metabolites, principal component analysis and hierarchical cluster analysis were performed to visualize the degree and specificity of the influences of CPT-11 on the metabolic profiles of nine tissues. Using this step-wise method, ileum, jejunum, and liver were finally recognized as target toxic tissues. Furthermore, tissue specific responses of liver, ileum, and jejunum to CPT-11 were dissected and specific differential metabolites were screened out. Perturbations in Krebs cycle, amino acid, purine and bile acid metabolism were observed in target toxic tissues. In conclusion, our study put forward a new approach to dissect target toxic tissues and tissue specific responses of CPT-11 using metabolomics.	1	1	3.99e-05
55310	CROWDSOURCING BASED SOCIAL MEDIA DATA ANALYSIS OF URBAN EMERGENCY EVENTS \| XU Z, 2017, MULTIMED TOOLS APPL, V76, P11567, DOI 10.1007/S11042-015-2731-1 \| An urban emergency event requires an immediate reaction or assistance for an emergency situation. With the popularity of the World Wide Web, the internet is becoming a major information provider and disseminator of emergency events and this is due to its real-time, open, and dynamic features. However, faced with the huge, disordered and continuous nature of web resources, it is impossible for people to efficiently recognize, collect and organize these events. In this paper, a crowdsourcing based burst computation algorithm of an urban emergency event is developed in order to convey information about the event clearly and to help particular social groups or governments to process events effectively. A definition of an urban emergency event is firstly introduced. This serves as the foundation for using web resources to compute the burst power of events on the web. Secondly, the different temporal features of web events are developed to provide the basic information for the proposed computation algorithm. Moreover, the burst power is presented to integrate the above temporal features of an event. Empirical experiments on real datasets show that the burst power can be used to analyze events.	0	11	3.93e-05
34610	CHARACTERIZATION OF PET/CT IMAGES USING TEXTURE ANALYSIS: THE PAST, THE PRESENTA… ANY FUTURE? \| HATT M, 2017, EUR J NUCL MED MOL I, V44, P151, DOI 10.1007/S00259-016-3427-0 \| After seminal papers over the period 2009 - 2011, the use of texture analysis of PET/CT images for quantification of intratumour uptake heterogeneity has received increasing attention in the last 4 years. Results are difficult to compare due to the heterogeneity of studies and lack of standardization. There are also numerous challenges to address. In this review we provide critical insights into the recent development of texture analysis for quantifying the heterogeneity in PET/CT images, identify issues and challenges, and offer recommendations for the use of texture analysis in clinical research. Numerous potentially confounding issues have been identified, related to the complex workflow for the calculation of textural features, and the dependency of features on various factors such as acquisition, image reconstruction, preprocessing, functional volume segmentation, and methods of establishing and quantifying correspondences with genomic and clinical metrics of interest. A lack of understanding of what the features may represent in terms of the underlying pathophysiological processes and the variability of technical implementation practices makes comparing results in the literature challenging, if not impossible. Since progress as a field requires pooling results, there is an urgent need for standardization and recommendations/guidelines to enable the field to move forward. We provide a list of correct formulae for usual features and recommendations regarding implementation. Studies on larger cohorts with robust statistical analysis and machine learning approaches are promising directions to evaluate the potential of this approach.	0	4	3.93e-05
5659	BEYOND THE TURK: ALTERNATIVE PLATFORMS FOR CROWDSOURCING BEHAVIORAL RESEARCH \| PEER E, 2017, J EXP SOC PSYCHOL, V70, P153, DOI 10.1016/J.JESP.2017.01.006 \| The success of Amazon Mechanical Turk (MTurk) as an online research platform has come at a price: MTurk has suffered from slowing rates of population replenishment, and growing participant non-naivety. Recently, a number of alternative platforms have emerged, offering capabilities similar to MTurk but providing access to new and more naive populations. After surveying several options, we empirically examined two such platforms, CrowdFlower (CF) and Prolific Academic (ProA). In two studies, we found that participants on both platforms were more naive and less dishonest compared to MTurk participants. Across the three platforms, CF provided the best response rate, but CF participants failed more attention-check questions and did not reproduce known effects replicated on ProA and MTurk. Moreover, ProA participants produced data quality that was higher than CF’s and comparable to MTurk’s. ProA and CF participants were also much more diverse than participants from MTurk. (C) 2017 Elsevier Inc. All rights reserved.	0	4	3.93e-05
45010	MODELLING THE CYTOTOXIC ACTIVITY OF PYRAZOLO-TRIAZOLE HYBRIDS USING DESCRIPTORS CALCULATED FROM THE OPEN SOURCE TOOL PADEL-DESCRIPTOR \| AMIN SA, 2016, J TAIBAH UNIV SCI, V10, P896, DOI 10.1016/J.JTUSCI.2016.04.009 \| In this study, we developed QSAR models for the anti-proliferative activity of pyrazolo-triazole hybrids [(1-benzy1-1H-1,2,3triazol-4-y1)(1,3-diphenyl-1H-pyrazol-4-y1) methanone] on human brain cancer (U87MG), lung cancer (A549), prostate cancer (PC-3), and colon cancer (HT-29) cell lines. We employed K-means cluster analysis to split the data sets. Statistically robust models were generated [pIC(50) (U87MG): R= 0.873, Q(2)=0.554, R-pred(2), = 0.866; pIC(50) (A549): R= 0.879, Q(2) = 0.637, R-pred(2) = 0.858; pIC(50) (PC3): R = 0.953; Q(2)=0.850; R-pred(2) = 0.796;pIC(50) (HT-29): R = 0.962, Q(2) = 0.891; R-pred(2) = 0.707]. The reliability of these models was confirmed by acceptable validation parameters, and these models also satisfied the Golbraikh and Tropsha acceptable model criteria. The QSAR study highlighted the atomic feature and molecular descriptors, information content descriptors, and topological and constitutional descriptors that affect anti-cancer activity. (C) 2016 The Authors. Production and hosting by Elsevier B.V.	0	3	3.93e-05
18816	DESIGN OF EFFICIENT COMPUTATIONAL WORKFLOWS FOR IN SILICO DRUG REPURPOSING \| VANHAELEN Q, 2017, DRUG DISCOV TODAY, V22, P210, DOI 10.1016/J.DRUDIS.2016.09.019 \| Here, we provide a comprehensive overview of the current status of in silico repurposing methods by establishing links between current technological trends, data availability and characteristics of the algorithms used in these methods. Using the case of the computational repurposing of fasudil as an alternative autophagy enhancer, we suggest a generic modular organization of a repurposing workflow. We also review 3D structure based, similarity-based, inference-based and machine learning (ML)-based methods. We summarize the advantages and disadvantages of these methods to emphasize three current technical challenges. We finish by discussing current directions of research, including possibilities offered by new methods, such as deep learning.	0	2	3.93e-05
2465	PAN-CANCER IMMUNOGENOMIC ANALYSES REVEAL GENOTYPE-IMMUNOPHENOTYPE RELATIONSHIPS AND PREDICTORS OF RESPONSE TO CHECKPOINT BLOCKADE \| CHAROENTONG P, 2017, CELL REP, V18, P248, DOI 10.1016/J.CELREP.2016.12.019 \| The Cancer Genome Atlas revealed the genomic landscapes of human cancers. In parallel, immunotherapy is transforming the treatment of advanced cancers. Unfortunately, the majority of patients do not respond to immunotherapy, making the identification of predictive markers and the mechanisms of resistance an area of intense research. To increase our understanding of tumor-immune cell interactions, we characterized the intratumoral immune landscapes and the cancer antigenomes from 20 solid cancers and created The Cancer Immunome Atlas (https://tcia.at/). Cellular characterization of the immune infiltrates showed that tumor genotypes determine immunophenotypes and tumor escape mechanisms. Using machine learning, we identified determinants of tumor immunogenicity and developed a scoring scheme for the quantification termed immunophenoscore. The immunophenoscore was a superior predictor of response to anti-cytotoxic T lymphocyte antigen-4 (CTLA-4) and anti-programmed cell death protein 1 (anti-PD-1) antibodies in two independent validation cohorts. Our findings and this resource may help inform cancer immunotherapy and facilitate the development of precision immuno-oncology.	0	2	3.93e-05
3531	FUZZY CLUSTER BASED NEURAL NETWORK CLASSIFIER FOR CLASSIFYING BREAST TUMORS IN ULTRASOUND IMAGES \| SINGH BK, 2016, EXPERT SYST APPL, V66, P114, DOI 10.1016/J.ESWA.2016.09.006 \| The performance of supervised classification algorithms is highly dependent on the quality of training data. Ambiguous training patterns may misguide the classifier leading to poor classification performance. Further, the manual exploration of class labels is an expensive and time consuming process. An automatic method is needed to identify noisy samples in the training data to improve the decision making process. This article presents a new classification technique by combining an unsupervised learning technique (i.e. fuzzy c-means clustering (FCM)) and supervised learning technique (i.e. back-propagation artificial neural network (BPANN)) to categorize benign and malignant tumors in breast ultrasound images. Unsupervised learning is employed to identify ambiguous examples in the training data. Experiments were conducted on 178 B-mode breast ultrasound images containing 88 benign and 90 malignant cases on MATLAB software platform. A total of 457 features were extracted from ultrasound images followed by feature selection to determine the most significant features. Accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC) and Mathew’s correlation coefficient (MCC) were used to access the performance of different classifiers. The result shows that the proposed approach achieves classification accuracy of 95.862% when all the 457 features were used for classification. However, the accuracy is reduced to 94.138% when only 19 most relevant features selected by multi-criterion feature selection approach were used for classification. The results were discussed in light of some recently reported studies. The empirical results suggest that eliminating doubtful training examples can improve the decision making performance of expert systems. The proposed approach show promising results and need further evaluation in other applications of expert and intelligent systems. (C) 2016 Elsevier Ltd. All rights reserved.	0	2	3.93e-05
36910	CHEMICAL COMPOSITION AND SOURCE APPORTIONMENT OF PM2.5 DURING CHINESE SPRING FESTIVAL AT XINXIANG, A HEAVILY POLLUTED CITY IN NORTH CHINA: FIREWORKS AND HEALTH RISKS \| FENG J, 2016, ATMOS RES, V182, P176, DOI 10.1016/J.ATMOSRES.2016.07.028 \| Twenty-four PM2.5 samples were collected at a suburban site of Xinxiang during Chinese Spring Festival (SF) in 2015. 10 water-soluble ions, 19 trace elements and 8 fractions of carbonaceous species in PM2.5 were analyzed. Potential sources of PM2.5 were quantitatively apportioned using principal component analysis (PCA)-multivariate linear regressions (MLR). The threat of heavy metals in PM2.5 was assessed using incremental lifetime cancer risk (ILCR). During the whole period, serious regional haze pollution persisted, the average concentration of PM2.5 was 111 +/- 54 mu g m(-3), with 95.8% and 79.2% of the daily samples exhibiting higher PM2.5 concentrations than the national air quality standard I and II. Chemical species declined due to holiday effect with the exception of K, Fe, Mg. Al and K+, Cl-, which increased on Chinese New Year (CNY)’s Eve and Lantern Festival in 2015, indicating the injection of firework burning particles in certain short period. PM2.5 mass closure showed that secondary inorganic species were the dominant fractions of PM2.5 over the entire sampling (37.3%). 72-hour backward trajectory clusters indicated that most serious air pollution occurred when air masses transported from the Inner Mongolia, Shanxi and Zhengzhou. Health risk assessment revealed that noncancerous effects of heavy metals in PM2.5 of Xinxiang were unlikely happened, while lifetime cancer risks of heavy metals obviously exceeded the threshold, which might have a cancer risk for residents in Xinxiang. This study provided detailed composition data and first comprehensive analysis of PM2.5 during the Spring Festival period in Xinxiang. (C) 2016 Published by Elsevier B.V.	0	2	3.93e-05
41110	FEATURE SELECTION METHODS FOR BIG DATA BIOINFORMATICS: A SURVEY FROM THE SEARCH PERSPECTIVE \| WANG L, 2016, METHODS, V111, P21, DOI 10.1016/J.YMETH.2016.08.014 \| This paper surveys main principles of feature selection and their recent applications in big data bioinformatics. Instead of the commonly used categorization into filter, wrapper, and embedded approaches to feature selection, we formulate feature selection as a combinatorial optimization or search problem and categorize feature selection methods into exhaustive search, heuristic search, and hybrid methods, where heuristic search methods may further be categorized into those with or without data-distilled feature ranking measures. (C) 2016 Elsevier Inc. All rights reserved.	0	2	3.93e-05
4333	NANOSCOPIC TUMOR TISSUE DISTRIBUTION OF PLATINUM AFTER INTRAPERITONEAL ADMINISTRATION IN A XENOGRAFT MODEL OF OVARIAN CANCER \| CARLIER C, 2016, J PHARMACEUT BIOMED, V131, P256, DOI 10.1016/J.JPBA.2016.09.004 \| There is increasing interest in the treatment of advanced stage ovarian cancer (OC) using intraperitoneal (IP) delivery of platinum (Pt)-based chemotherapy. The antitumor efficacy of IP chemotherapy is determined by efficient tumor tissue penetration. Although it is assumed that Pt penetration is limited to a few millimeters after IP delivery, little is known on the distribution of Pt in different tumor compartments at the ultrastructural level following IP administration. Here, using synchrotron radiation X-ray fluorescence spectrometry (SR-XRF) and laser ablation inductively coupled plasma-mass spectrometry (LA-ICP-MS), Pt distribution and penetration in OC peritoneal xenografts were determined at nanometer scale after IP chemoperfusion of dsplatin at 37-38 degrees C or 40-41 degrees C (hyperthermic). Using principal component analysis (PCA) the presence of phosphorus, manganese, calcium, zinc, iron, bromine, and sulfur was correlated with the distribution of Pt, while k-means analysis was used to quantify the amount of Pt in weight% in tumor stroma and in tumor cells. The results showed a heterogeneous distribution of Pt throughout the tumor, with an accumulation in the extracellular matrix. LA-ICP-MS mappings indicated significantly higher concentrations of Pt (P = 0.0062) after hyperthermic chemoperfusion of cisplatin, while SR-XRF demonstrated a deeper tissue Pt penetration after hyperthermic treatment Using PCA, it was showed that Pt co-localizes with bromine and sulfur. No differences were observed in Pt distribution regarding tumor cells and stroma, when comparing normo- vs. hyperthermic treatment In conclusion, SR-XRF and LA-ICP-MS are suitable and highly sensitive techniques to analyze the penetration depth and distribution of Pt-based drugs after IP administration. To the best of our knowledge, this is the first experiment in which the distribution of Pt is analyzed at the cellular level after IP administration of cisplatin. (C) 2016 Elsevier B.V. All rights reserved.	0	2	3.93e-05
4729	BIG DATA AND MACHINE LEARNING IN RADIATION ONCOLOGY: STATE OF THE ART AND FUTURE PROSPECTS \| BIBAULT J, 2016, CANCER LETT, V382, P110, DOI 10.1016/J.CANLET.2016.05.033 \| Precision medicine relies on an increasing amount of heterogeneous data. Advances in radiation oncology, through the use of CT Scan, dosimetry and imaging performed before each fraction, have generated a considerable flow of data that needs to be integrated. In the same time, Electronic Health Records now provide phenotypic profiles of large cohorts of patients that could be correlated to this information. In this review, we describe methods that could be used to create integrative predictive models in radiation oncology. Potential uses of machine learning methods such as support vector machine, artificial neural networks, and deep learning are also discussed. (C) 2016 Elsevier Ireland Ltd. All rights reserved.	0	2	3.93e-05
48410	A COMPUTATIONAL APPROACH FOR DETECTING PIGMENTED SKIN LESIONS IN MACROSCOPIC IMAGES \| OLIVEIRA RB, 2016, EXPERT SYST APPL, V61, P53, DOI 10.1016/J.ESWA.2016.05.017 \| Skin cancer is considered one of the most common types of cancer in several countries and its incidence rate has increased in recent years. Computational methods have been developed to assist dermatologists in early diagnosis of skin cancer. Computational analysis of skin lesion images has become a challenging research area due to the difficulty in discerning some types of skin lesions. A novel computational approach is presented for extracting skin lesion features from images based on asymmetry, border, colour and texture analysis, in order to diagnose skin lesion types. The approach is based on an anisotropic diffusion filter, an active contour model without edges and a support vector machine. Experiments were performed regarding the segmentation and classification of pigmented skin lesions in macroscopic images, with the results obtained being very promising. (C) 2016 Elsevier Ltd. All rights reserved.	0	2	3.93e-05
49010	IDENTIFICATION AND COMPARATIVE ORIDONIN METABOLISM IN DIFFERENT SPECIES LIVER MICROSOMES BY USING UPLC-TRIPLE-TOF-MS/MS AND PCA \| MA Y, 2016, ANAL BIOCHEM, V511, P61, DOI 10.1016/J.AB.2016.08.004 \| Oridonin (ORI) is an active natural ent-kaurene diterpenoid ingredient with notable anti-cancer and anti inflammation activities. Currently, a strategy was developed to identify metabolites and to assess the metabolic profiles of ORI in vitro using ultra-high-performance liquid chromatography-Triple/time-of-flight mass spectrometry (UPLC-Triple-TOF-MS/MS). Meanwhile, the metabolism differences of ORI in the liver microsomes of four different species were investigated using a principal component analysis (PCA) based on the metabolite absolute peak area values as the variables. Based on the proposed methods, 27 metabolites were structurally characterized. The results indicate that ORI is universally metabolized in vitro, and the metabolic pathway mainly includes dehydration, hydroxylation, di-hydroxylation, hydrogenation, decarboxylation, and ketone formation. Overall, there are obvious inter species differences in types and amounts of ORI metabolites in the four species. These results will provide basic data for future pharmacological and toxicological studies of ORI and for other ent-kauranes diterpenoids. Meanwhile, studying the ORI metabolic differences helps to select the proper animal model for further pharmacology and toxicological assessment. (C) 2016 Elsevier Inc. All rights reserved.	0	2	3.93e-05
54410	GAMIFYING COLLECTIVE HUMAN BEHAVIOR WITH GAMEFUL DIGITAL RHETORIC \| SAKAMOTO M, 2017, MULTIMED TOOLS APPL, V76, P12539, DOI 10.1007/S11042-016-3665-Y \| This paper presents a design framework called Gameful Digital Rhetoric that offers a set of design frames for designing meaningful digital rhetoric that guides collective human behavior in ubiquitous social digital services, such as crowdsourcing. The framework is extracted from our experiences with building and developing crowdsourcing case studies. From a video game perspective, the paper has categorized our experiences into seven design frames to encourage collective human activity. This approach is different from traditional gamification, as it focuses more on the semiotic aspect of virtuality in the video games, not game mechanics; it helps to enhance the current meaning of the real world for changing human attitude and behavior through various socio-cultural and psychological techniques. Therefore, it is possible to discuss respective design frames for enhancing crowdsourcing by incrementally adding new digital rhetoric. The paper also presents how Gameful Digital Rhetoric allows us to guide collective human behavior in Collectivist Crowdsourcing; the design is explained through a scenario-based and experiment-based analyses. The paper then discusses how to design collective human behavior with Gameful Digital Rhetoric and how to identify the design’s potential pitfalls. Our approach offers useful insights into the design of future social digital services that influence collective human behavior.	0	2	3.93e-05
55110	RULE-GUIDED HUMAN CLASSIFICATION OF VOLUNTEERED GEOGRAPHIC INFORMATION \| ALI AL, 2017, ISPRS J PHOTOGRAMM, V127, P3, DOI 10.1016/J.ISPRSJPRS.2016.06.003 \| During the last decade, web technologies and location sensing devices have evolved generating a form of crowdsourcing known as Volunteered Geographic Information (VGI). VGI acted as a platform of spatial data collection, in particular, when a group of public participants are involved in collaborative mapping activities: they work together to collect, share, and use information about geographic features. VGI exploits participants’ local knowledge to produce rich data sources. However, the resulting data inherits problematic data classification. In VGI projects, the challenges of data classification are due to the following: (i) data is likely prone to subjective classification, (ii) remote contributions and flexible contribution mechanisms in most projects, and (iii) the uncertainty of spatial data and non-strict definitions of geographic features. These factors lead to various forms of problematic classification: inconsistent, incomplete, and imprecise data classification. This research addresses classification appropriateness. Whether the classification of an entity is appropriate or inappropriate is related to quantitative and/or qualitative observations. Small differences between observations may be not recognizable particularly for non-expert participants. Hence, in this paper, the problem is tackled by developing a rule-guided classification approach. This approach exploits data mining techniques of Association Classification (AC) to extract descriptive (qualitative) rules of specific geographic features. The rules are extracted based on the investigation of qualitative topological relations between target features and their context. Afterwards, the extracted rules are used to develop a recommendation system able to guide participants to the most appropriate classification. The approach proposes two scenarios to guide participants towards enhancing the quality of data classification. An empirical study is conducted to investigate the classification of grass-related features like forest, garden, park, and meadow. The findings of this study indicate the feasibility of the proposed approach. (C) 2016 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.	0	2	3.93e-05
36010	ICAGES: INTEGRATED CANCER GENOME SCORE FOR COMPREHENSIVELY PRIORITIZING DRIVER GENES IN PERSONAL CANCER GENOMES \| DONG C, 2016, GENOME MED, V8, P, DOI 10.1186/S13073-016-0390-0 \| Cancer results from the acquisition of somatic driver mutations. Several computational tools can predict driver genes from population-scale genomic data, but tools for analyzing personal cancer genomes are underdeveloped. Here we developed iCAGES, a novel statistical framework that infers driver variants by integrating contributions from coding, non-coding, and structural variants, identifies driver genes by combining genomic information and prior biological knowledge, then generates prioritized drug treatment. Analysis on The Cancer Genome Atlas (TCGA) data showed that iCAGES predicts whether patients respond to drug treatment (P = 0.006 by Fisher’s exact test) and long-term survival (P = 0.003 from Cox regression).	0	1	3.93e-05
2912	SUBGROUPS OF CASTRATION-RESISTANT PROSTATE CANCER BONE METASTASES DEFINED THROUGH AN INVERSE RELATIONSHIP BETWEEN ANDROGEN RECEPTOR ACTIVITY AND IMMUNE RESPONSE \| YLITALO EB, 2017, EUR UROL, V71, P776, DOI 10.1016/J.EURURO.2016.07.033 \| Background: Novel therapies for men with castration-resistant prostate cancer (CRPC) are needed, particularly for cancers not driven by androgen receptor (AR) activation. Objectives: To identify molecular subgroups of PC bone metastases of relevance for therapy. Design, setting, and participants: Fresh-frozen bone metastasis samples from men with CRPC (n = 40), treatment-nai " ve PC (n = 8), or other malignancies (n = 12) were characterized using whole-genome expression profiling, multivariate principal component analysis (PCA), and functional enrichment analysis. Expression profiles were verified by reverse transcription-polymerase chain reaction (RT-PCR) in an extended set of bone metastases (n = 77) and compared to levels in malignant and adjacent benign prostate tissue from patients with localized disease (n = 12). Selected proteins were evaluated using immunohistochemistry. A cohort of PC patients (n = 284) diagnosed at transurethral resection with long follow-up was used for prognostic evaluation. Results and limitations: The majority of CRPC bone metastases (80%) was defined as AR driven based on PCA analysis and high expression of the AR, AR co-regulators (FOXA1, HOXB13), and AR-regulated genes (KLK2, KLK3, NKX3.1, STEAP2, TMPRSS2); 20% were non-AR-driven. Functional enrichment analysis indicated high metabolic activity and low immune responses in AR-driven metastases. Accordingly, infiltration of CD3+ and CD68+ cells was lower in AR-driven than in non-AR-driven metastases, and tumor cell HLA class I ABC immunoreactivity was inversely correlated with nuclear AR immunoreactivity. RT-PCR analysis showed low MHC class I expression (HLA-A, TAP1, and PSMB9 mRNA) in PC bone metastases compared to benign and malignant prostate tissue and bone metastases of other origins. In primary PC, low HLA class I ABC immunoreactivity was associated with high Gleason score, bone metastasis, and short cancer-specific survival. Limitations include the limited number of patients studied and the single metastasis sample studied per patient. Conclusions: Most CRPC bone metastases show high AR and metabolic activities and low immune responses. A subgroup instead shows low AR and metabolic activities, but high immune responses. Targeted therapy for these groups should be explored. Patient summary: We studied heterogeneities at a molecular level in bone metastasis samples obtained from men with castration-resistant prostate cancer. We found differences of possible importance for therapy selection in individual patients. (C) 2016 European Association of Urology. Published by Elsevier B. V.	0	1	3.93e-05
5851	UNTARGETED LC-HRMS-BASED METABOLOMICS FOR SEARCHING NEW BIOMARKERS OF PANCREATIC DUCTAL ADENOCARCINOMA: A PILOT STUDY \| RIOS PS, 2017, SLAS DISCOV, V22, P348, DOI 10.1177/1087057116671490 \| Pancreatic ductal adenocarcinoma is one of the most lethal tumors since it is usually detected at an advanced stage in which surgery and/or current chemotherapy have limited efficacy. The lack of sensitive and specific markers for diagnosis leads to a dismal prognosis. The purpose of this study is to identify metabolites in serum of pancreatic ductal adenocarcinoma patients that could be used as diagnostic biomarkers of this pathology. We used liquid chromatography-high-resolution mass spectrometry for a nontargeted metabolomics approach with serum samples from 28 individuals, including 16 patients with pancreatic ductal adenocarcinoma and 12 healthy controls. Multivariate statistical analysis, which included principal component analysis and partial least squares, revealed clear separation between the patient and control groups analyzed by liquid chromatography-high-resolution mass spectrometry using a nontargeted metabolomics approach. The metabolic analysis showed significantly lower levels of phospholipids in the serum from patients with pancreatic ductal adenocarcinoma compared with serum from controls. Our results suggest that the liquid chromatography-high-resolution mass spectrometry-based metabolomics approach provides a potent and promising tool for the diagnosis of pancreatic ductal adenocarcinoma patients using the specific metabolites identified as novel biomarkers that could be used for an earlier detection and treatment of these patients.	0	1	3.93e-05
7739	WESTERN DIETARY PATTERN INCREASES, AND PRUDENT DIETARY PATTERN DECREASES, RISK OF INCIDENT DIVERTICULITIS IN A PROSPECTIVE COHORT STUDY \| STRATE LL, 2017, GASTROENTEROLOGY, V152, P1023, DOI 10.1053/J.GASTRO.2016.12.038 \| BACKGROUND & AIMS: Dietary fiber is implicated as a risk factor for diverticulitis. Analyses of dietary patterns may provide information on risk beyond those of individual foods or nutrients. We examined whether major dietary patterns are associated with risk of incident diverticulitis. METHODS: We performed a prospective cohort study of 46,295 men who were free of diverticulitis and known diverticulosis in 1986 (baseline) using data from the Health Professionals Follow-Up Study. Each study participant completed a detailed medical and dietary questionnaire at baseline. We sent supplemental questionnaires to men reporting incident diverticulitis on biennial follow-up questionnaires. We assessed diet every 4 years using a validated food frequency questionnaire. Western (high in red meat, refined grains, and high-fat dairy) and prudent (high in fruits, vegetables, and whole grains) dietary patterns were identified using principal component analysis. Follow-up time accrued from the date of return of the baseline questionnaire in 1986 until a diagnosis of diverticulitis, diverticulosis or diverticular bleeding; death; or December 31, 2012. The primary end point was incident diverticulitis. RESULTS: During 894,468 person years of follow-up, we identified 1063 incident cases of diverticulitis. After adjustment for other risk factors, men in the highest quintile of Western dietary pattern score had a multivariate hazard ratio of 1.55 (95% CI, 1.20-1.99) for diverticulitis compared to men in the lowest quintile. High vs low prudent scores were associated with decreased risk of diverticulitis (multivariate hazard ratio, 0.74; 95% CI, 0.60-0.91). The association between dietary patterns and diverticulitis was predominantly attributable to intake of fiber and red meat. CONCLUSIONS: In a prospective cohort study of 46,295 men, a Western dietary pattern was associated with increased risk of diverticulitis, and a prudent pattern was associated with decreased risk. These data can guide dietary interventions for the prevention of diverticulitis.	0	1	3.93e-05
8033	A SURVEY ON SEMI-SUPERVISED FEATURE SELECTION METHODS \| SHEIKHPOUR R, 2017, PATTERN RECOGN, V64, P141, DOI 10.1016/J.PATCOG.2016.11.003 \| Feature selection is a significant task in data mining and machine learning applications which eliminates irrelevant and redundant features and improves learning performance. In many real-world applications, collecting labeled data is difficult, while abundant unlabeled data are easily accessible. This motivates researchers to develop semi-supervised feature selection methods which use both labeled and unlabeled data to evaluate feature relevance. However, till-to-date, there is no comprehensive survey covering the semi supervised feature selection methods. In this paper, semi-supervised feature selection methods are fully investigated and two taxonomies of these methods are presented based on two different perspectives which represent the hierarchical structure of semi-supervised feature selection methods. The first perspective is based on the basic taxonomy of feature selection methods and the second one is based on the taxonomy of semi supervised learning methods. This survey can be helpful for a researcher to obtain a deep background in semi supervised feature selection methods and choose a proper semi-supervised feature selection method based on the hierarchical structure of them.	0	1	3.93e-05
9551	CURE-SMOTE ALGORITHM AND HYBRID ALGORITHM FOR FEATURE SELECTION AND PARAMETER OPTIMIZATION BASED ON RANDOM FORESTS \| MA L, 2017, BMC BIOINFORMATICS, V18, P, DOI 10.1186/S12859-017-1578-Z \| Background: The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. Results: We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. Conclusion: The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm’s F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.	0	0	3.93e-05
49110	LIPIDOMIC PROFILING OF LUNG PLEURAL EFFUSION IDENTIFIES UNIQUE METABOTYPE FOR EGFR MUTANTS IN NON-SMALL CELL LUNG CANCER \| HO YS, 2016, SCI REP-UK, V6, P, DOI 10.1038/SREP35110 \| Cytology and histology forms the cornerstone for the diagnosis of non-small cell lung cancer (NSCLC) but obtaining sufficient tumour cells or tissue biopsies for these tests remains a challenge. We investigate the lipidome of lung pleural effusion (PE) for unique metabolic signatures to discriminate benign versus malignant PE and EGFR versus non-EGFR malignant subgroups to identify novel diagnostic markers that is independent of tumour cell availability. Using liquid chromatography mass spectrometry, we profiled the lipidomes of the PE of 30 benign and 41 malignant cases with or without EGFR mutation. Unsupervised principal component analysis revealed distinctive differences between the lipidomes of benign and malignant PE as well as between EGFR mutants and non-EGFR mutants. Docosapentaenoic acid and Docosahexaenoic acid gave superior sensitivity and specificity for detecting NSCLC when used singly. Additionally, several 20-and 22-carbon polyunsaturated fatty acids and phospholipid species were significantly elevated in the EGFR mutants compared to non-EGFR mutants. A 7-lipid panel showed great promise in the stratification of EGFR from non-EGFR malignant PE. Our data revealed novel lipid candidate markers in the non-cellular fraction of PE that holds potential to aid the diagnosis of benign, EGFR mutation positive and negative NSCLC.	0	0	3.93e-05

Not included in the dataset

These papers and other references were not among the 600 records downloaded from the Web of Science.

# Sort citations_ref by InDegree, decreasing
citations_ref <- citations_ref[with(citations_ref, order(-InDegree)), ]
# Extract top 25
top_ref <- head(citations_ref, 25)
# Sort by PageRank, decreasing
citations_ref <- citations_ref[with(citations_ref, order(-PageRank)), ]
# Add to list of highes in degree papers (references)
top_ref <- rbind(top_ref, head(citations_ref, 25))
# Remove duplicates
top_ref <- top_ref[!duplicated(top_ref[, "FullReference"]), ]
# Sort by InDegree, break ties by PageRank
top_ref <- top_ref[with(top_ref, order(-InDegree, -PageRank)), ]
# Print results
knitr::kable(top_ref[, c("FullReference", "InDegree", "PageRank")])

	FullReference	InDegree	PageRank
1461	BREIMAN L, 2001, MACH LEARN, V45, P5, DOI 10.1023/A:1010933404324	46	7.83e-05
214	CHIH-CHUNG CHANG, 2011, ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, V2, DOI 10.1145/1961189.1961199	39	7.34e-05
291	CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411	32	6.44e-05
2756	2	22	5.83e-05
457	HARALICK RM, 1973, IEEE T SYST MAN CYB, VSMC3, P610, DOI 10.1109/TSMC.1973.4309314	17	5.15e-05
3353	GOLUB TR, 1999, SCIENCE, V286, P531, DOI 10.1126/SCIENCE.286.5439.531	17	4.99e-05
1386	GUYON I, 2002, MACH LEARN, V46, P389, DOI 10.1023/A:1012487302797	16	5.06e-05
1950	LECUN Y, 2015, NATURE, V521, P436, DOI 10.1038/NATURE14539	16	4.99e-05
61	PENG HC, 2005, IEEE T PATTERN ANAL, V27, P1226, DOI 10.1109/TPAMI.2005.159	16	4.93e-05
2435	HALL M., 2009, SIGKDD EXPLORATIONS, V11, P10, DOI 10.1145/1656274.1656278	15	5.59e-05
4406	BREIMAN L, 1996, MACH LEARN, V24, P123, DOI 10.1023/A:1018054314350	15	5.00e-05
3367	SAEYS Y, 2007, BIOINFORMATICS, V23, P2507, DOI 10.1093/BIOINFORMATICS/BTM344	15	4.83e-05
2798	TIBSHIRANI R, 1996, J ROY STAT SOC B MET, V58, P267	13	5.00e-05
129	HANAHAN D, 2011, CELL, V144, P646, DOI 10.1016/J.CELL.2011.02.013	12	4.95e-05
8006	KOUROU K, 2015, COMPUT STRUCT BIOTEC, V13, P8, DOI 10.1016/J.CSBJ.2014.11.005	11	4.78e-05
364	CHAWLA NV, 2002, J ARTIF INTELL RES, V16, P321	11	4.71e-05
900	PEDREGOSA F, 2011, J MACH LEARN RES, V12, P2825	11	4.67e-05
3598	GUYON I., 2003, JOURNAL OF MACHINE LEARNING RESEARCH, V3, P1157, DOI 10.1162/153244303322753616	11	4.60e-05
2597	SIEGEL RL, 2015, CA-CANCER J CLIN, V65, P5, DOI 10.3322/CAAC.21254	10	4.70e-05
1949	LECUN Y, 1998, P IEEE, V86, P2278, DOI 10.1109/5.726791	10	4.69e-05
2207	SRIVASTAVA N, 2014, J MACH LEARN RES, V15, P1929	10	4.63e-05
882	HINTON GE, 2006, SCIENCE, V313, P504, DOI 10.1126/SCIENCE.1127647	10	4.55e-05
2668	OJALA T, 2002, IEEE T PATTERN ANAL, V24, P971, DOI 10.1109/TPAMI.2002.1017623	10	4.54e-05
3841	JEMAL A, 2011, CA-CANCER J CLIN, V61, P2011, DOI 10.3322/CAAC.20107	9	4.89e-05
2891	TORRE LA, 2015, CA-CANCER J CLIN, V65, P87, DOI 10.3322/CAAC.21262	9	4.73e-05
1498	VAPNIK V.N., 1998, STAT LEARNING THEORY	9	4.48e-05
1659	HUANG ZW, 2003, INT J CANCER, V107, P1047, DOI 10.1002/IJC.11500	8	4.78e-05
1997	GURCAN M. N., 2009, BIOMEDICAL ENG IEEE, V2, P147, DOI 10.1109/RBME.2009.2034865	8	4.70e-05
1212	ARMATO SG, 2011, MED PHYS, V38, P915, DOI 10.1118/1.3528204	7	4.67e-05
9701	FENG SY, 2011, SCI CHINA LIFE SCI, V54, P828, DOI 10.1007/S11427-011-4212-8	4	4.67e-05

Most referenced publications

references <- unlist(strsplit(literature$CitedReferences, ";"))

get_publication <- function(x) {
    publication <- "Not found"
    try(publication <- unlist(strsplit(x, ","))[[3]], silent = TRUE)
    return(publication)
}

refPublications <- sapply(references, get_publication)
refPublications <- sapply(refPublications, trim)
refPublications <- refPublications[refPublications != "Not found"]
refPublications <- as.data.frame(table(refPublications))
names(refPublications) <- c("Publication", "Count")
refPublications <- refPublications[with(refPublications, order(-Count)), ]

refPublications <- transform(refPublications, Publication = reorder(Publication, 
    Count))

ggplot(head(refPublications, 25), aes(Publication, Count)) + geom_bar(stat = "identity", 
    fill = "orange") + coord_flip() + theme(legend.position = "none") + ggtitle("Most referenced publications") + 
    xlab("Publication") + ylab("Count")

Topic Model

Topic modeling is a type of statistical text mining method for discovering common “topics” that occur in a collection of documents. A topic modeling algorithm essentially looks through the abstracts included in the datasets for clusters of co-occurring of words and groups them together by a process of similarity.

The following columns describe each topic detected using LDA topic modeling by listing the ten most characteristic words in each topic.

You can specify K, the number of topics, when calling build_topicmodel_from_literature(literature, K). If left empty, stm::searchK function is used to estimate the number of topics. For performance reasons the search range is between 4 and 12. The number of topics is estimated using the structural topic model library semantic coherence diagnostic values. Raw values are available in output file as kqualityvalues.csv and can be interpreted with stm documentation if necessary (see section 3.4).

The analysis below creates the topic model using the convenience functions and then prints out ten most descriptive words for each discovered topic. See topicmodels documentation on the TopicModel-class on other information and instructions and documentation on build_topicmodel_from_literature how to use the rest of the data the convenience function procides.

topicmodel <- build_topicmodel_from_literature(literature)

topickeywords <- topicmodels::terms(topicmodel$fit, 10)
tw <- data.frame(topickeywords)
colnames(tw) <- gsub("X", "Topic ", colnames(tw))
knitr::kable(tw, col.names = colnames(tw))

Topic.1	Topic.2	Topic.3	Topic.4	Topic.5
analysi	crowdsourc	cancer	imag	model
patient	data	gene	cancer	method
studi	inform	predict	breast	data
compon	research	cell	detect	featur
sampl	studi	identifi	method	learn
princip	collect	express	system	algorithm
cancer	task	analysi	base	classif
risk	design	tumor	featur	machin
valid	system	studi	segment	perform
signific	provid	treatment	result	select

Performing Literature Reviews with nails package

Juho Salminen and Antti Knutas

2017-12-17

Introduction