As a crude way to look at analytical method metrics, text mining of the research literature will be done at three levels. These are based on:
- title only search
- abstract only search
- fulltext search
Metrics Based on Title Search (9/18/14)
To start this part of the project the simplest way is to look at one piece of metadata freely available for all research articles - the title. However, this is still not as easy as it might seem because the context of the metrics needs to be narrowed to analytical chemistry to make the analysis even remotely useful. As a result, the two available options to evaluate title data were via SciFinder Scholar and Crossref's Metadata API. Because Scifinder does not have an API that can be used to search its database using scripting languages Crossref's API was chosen.
In the last year Crossref has moved into the area of API's and text mining to leverage the database of DOI's and metadata that it holds. A web based search interface is available at http://search.crossref.org/ and an API to the same data (that returns JSON) is available at https://api.crossref.org/. The API interface is still 'alpha' in the eyes of Crossref but has good documentation and allows access to all the current DOI records (~100,000,000).
In order to limit the search to analytical chemistry, the journals currently in Elsevier's Scopus product were extracted from the most recent journal title list. 117 journals and book series were filtered from this list of over 30,000 and the metadata transferred to a MySQL database. The list of journals was edited for those that were misidentified leaving 76 analytical journals.
In a separate table in MySQL, the twelve analytical metrics from the metrics poll were added with definitions and synonyms (details here). The synonyms were picked as search criteria for searching the Crossref database. A PHP script was written that automated the process of searching all 41 synonyms via the Crossref API for each journal (using ISSN). An example API URL for searching is:
https://api.crossref.org/works?query="limit+of+detection"&filter=issn:"0003-2700"
The JSON string returned from these queries was converted to a PHP array in the script and the 'total-results' parameter extracted. The data was saved to a third MySQL table with the journal, metric, and synonym data. Of the 76 analytical journals, 63 were found to have articles in Crossref and as a result over 2700 searches were run. Summary data is shown below and by journal data is available here.
Metric | # Articles | Percent Articles |
Coefficient of Determination | 12 | 0.002% |
Limit of Detection | 505 | 0.088% |
Limit of Linearity | 3 | 0.001% |
Limit of Quantitation | 25 | 0.004% |
Linear Dynamic Range | 106 | 0.018% |
Repeatability | 36 | 0.006% |
Reproducibility | 218 | 0.038% |
Selectivity | 1500 | 0.261% |
Sensitivity | 1609 | 0.280% |
Spike Recovery | 1 | <0.001% |
Sample Size | 39 | 0.007% |
Sample Throughput | 11 | 0.002% |
TOTAL | 575488 | - |
Looking at the data, the main metrics that show up in titles of articles are sensitivity, selectivity, and detection limit. These are probably not surprising, although it should be pointed out that the prescience of selectivity was definitely more prevalent as a metric in chromatography journals - for obvious reasons. It will be interesting to see how this compares to looking at abstract based data.
Metrics Based on Abstract Search (11/13/14)
To continue the evaluation of analytical metrics searching of abstracts of articles was the next step. To look at this I searched two data sources that contained abstracts of analytical papers; i) RSC's Analytical Abstracts (AA) database (commercial) at http://pubs.rsc.org/lus/analytical-abstracts and ii) The Flow Analysis Database (FAD) (free) at https://fad.stuchalk.domains.unf.edu. RSC was kind enough to give me access to AA because of the ChAMP project and I am the developer of the FAD website and backend MySQL database.
The FAD has 17310 papers through 2007 and although the abstracts are not available on the website I have collected over 99.9% of the abstracts in the database. AA has almost 500,000 articles and searches for the key terms were done online and the abstracts subsequently downloaded into a MySQL database. Using this process a subset of 187,224 was collected for subsequent analysis.
In order to compare the data between both sets of abstracts the AA dataset was cleaned using a process developed on the FAD dataset. First html tags and character entities were removed, special characters (CR and LF) deleted, and mispellings corrected. The last step of this process is acheived by creating a MySQL Full-text index on the abstract field of the database, exporting it to a text file, and importing into Excel. The Excel spreadsheet is then used to search for misspelled words (slow process) and apply corrections to the database by using the MySQL command
UPDATE 'citations' SET abs=replace(abs,'<term>','<corrected>') where MATCH (absft) AGAINST ('<term>');
For this work, effort focused primarily on words in the terms that would subsequently be searched. Some examples are given below of the misspellings for keywords
Determination | Limit | Quantitation | Quantitative | |
deteermination detemination detemrination deter (abbrev) deterination determ (abbrev) determation determiantion determinafion determinaiton |
determinatiion determinatin determinatiom determinaton determinatuion determinination determintation determintin determintion detmermination |
limita limitat limmit limt limts linit |
quanitation quanititation quantitaton quantition quantivation quantiation quatitation |
quantative quanitative quanititative quantiative (synonym) quantificative quantitiative quantitive (synonym) quantititative quantizative quatitative |
During the course of the cleanup phase alternate search terms were identified and added to the list of those to be searched for the respective metrics (see below). Correlation coefficients were added to coefficient of determination as they are related through r and authors more oftern report the correlation coefficient.
A summary of the statistics found is shown below (an Excel spreadsheet with the data will be available shortly). In general both datasets are in agreement in terms of the most frequently seen metrics and the most frequently used term for those metrics. The percentages of each metric found in the whole database are quite different and this might be explained by the slightly different perspectives; AA covering all of analytical chemistry, both quantitative and qualitative, and FAD specific to flow analysis - which is almost exclusively quantitative.
Looking at the data in the AA set in terms of the number of metrics found in each paper, the majority of papers report only one metric and nearly 76% of the papers report 1-3 metrics. Interestingly, 2% of the papers in the AA dataset did not have a metric even though the papers were downloaded by searching for the metrics. A deeper analysis of this is planned in the next month.
This analysis has provided a clear insight into analytical metrics as reported in the literature. It has provided a lot of information about how to deal with the full text analysis planned next.
Metrics Based on Fulltext Searching
This has not been completed yet.