Bug in compare_categories.py, supervised_learning.py, and detrend.py

30 10 2013
QIIME users,
 
A bug was recently discovered in QIIME 1.7.0 (and previous QIIME versions) that may affect some of QIIME’s scripts that wrap R functionality. These scripts include compare_categories.py, supervised_learning.py, and detrend.py.
 
The bug involves a discrepancy between the way that QIIME parses metadata mapping files. QIIME has two different parsers written in the Python and R programming languages. The Python parser strips leading/trailing whitespace from mapping file data fields, while the R parser does not. This can lead to the exclusion of samples, and/or incorrect grouping of samples based on a mapping file category, when the R parser is used. This bug may affect the results of compare_categories.py (only adonis, db-RDA, Moran’s I, MRPP, and PERMDISP; the other methods are fine), supervised_learning.py, and detrend.py. Mapping files that successfully passed check_id_map.py’s validation tests may still be affected by this issue.
 
IMPORTANT: If your mapping file has leading/trailing whitespace (with or without double quotes) in any of the mapping file fields and you used any of the aforementioned scripts, *your results may be incorrect*. It is very easy to accidentally add leading/trailing whitespace to your mapping file, especially if you are editing one by hand in a spreadsheet program such as Excel, so this bug may affect you even if you’re sure that your mapping file fields don’t have this issue.
 
Unfortunately, we cannot predict how your results may have changed due to this bug (e.g., we cannot predict whether statistical tests will be overly conservative, or how far off a test statistic might be from the correct value) since samples with leading/trailing whitespace will be dropped from analyses, and incorrect groupings of samples based on a categorical variable may be created based on whitespace. For example, a Treatment category may have some sample groups labeled as Control and Fast, but if a couple of the samples were labeled ‘ Fast   ‘ (without quotes), these samples would be artificially grouped together, instead of being grouped with the rest of the Fast samples. Thus, due to the random nature of sample exclusion and artificial grouping, we cannot predict how your results might be affected.
 
This parsing bug, as well as check_id_map.py’s validation tests and documentation, have been fixed in the latest development version of QIIME 1.7.0-dev (fixed on October 3, 2013, commit 5669b5891c26c9631c465243c046cdc33d0f8ba7). In order to avoid serious issues like this in the future, we are working on migrating the R parsing code into its own CRAN package (with extensive unit tests), as well as formally defining a metadata mapping file format so that other tools can consistently implement and validate this format.
 
If you are using a release version (or an older development version) of QIIME, we have put together a workaround that you can use until the next QIIME release. There is a script called strip_mapping_file_fields.py (hosted as a Gist on GitHub) that takes a mapping file as input, strips any double quotes (“), then strips any leading/trailing whitespace, and writes the resulting data to a new mapping file. This script requires that QIIME 1.7.0 is installed. If you are a MacQIIME user, you will need to first activate your MacQIIME environment by running the ‘macqiime’ command.
 
We highly recommend that you first use check_id_map.py to validate your mapping file and *ensure that there are no errors or warnings*. Next, use this script to “cleanse” your mapping file before using any of the aforementioned scripts (compare_categories.py, supervised_learning.py, and detrend.py).
 
Here’s how you can download and use the script. Assuming your mapping file is named map.txt, run:
 
check_id_map.py -m map.txt -o check_id_map_output
# Fix any issues with your mapping file until there are no errors or warnings.
git clone https://gist.github.com/7009545.git strip_mapping_file_fields
python strip_mapping_file_fields/strip_mapping_file_fields.py -m map.txt -o map_fixed.txt
# Continue with your analyses, using map_fixed.txt.
 
We apologize for any inconvenience this may cause you, and we will continue striving to prevent bugs of this nature as QIIME development progresses. As always, please get in touch with us on the QIIME forum if you have any issues or questions.
 
–Jai
Advertisements

Actions

Information




%d bloggers like this: