Welcome to The Bump Hunting Project by Patient Rule Induction Method. This website hosts a brief description of the goal of the project and its software PRIMsrc
. It describes why and how you can use the software and provides some general remarks and links about it.
Overview
The general problem in "Bump Hunting" (BH) is to identify, characterize and predict hidden structures in the data that are informative and significant. In practice, "Bump Hunting" refers to the task of mapping out local regions of the input space (attribute/feature/predictor) where a target function of interest, usually unknown, assumes larger (or smaller) values than its average over the entire space. These sought-after regions of extreme values in the target function are also known as local/global extrema supports. The input space to perform the "Bump Hunting" search may be any low or high-dimensional space where inputs may be any variables such as attributes, features, predictors, etc. The target function may be any function of interest. See the Wiki page for details.
The picture below illustrates the idea. The sunshine over the mountain range shows how light can uncover peaks, highlands and valleys, just like we want to do for data structures in the target function by "Bump Hunting".
(Bill Wight Photography, Copyright 2015, with permission)
"Bump Hunting" applies to mathematical / statistical problems such as:
- Mode(s) Hunting
- Local/Global Extremum(a) Finding
- Subgroup(s) Identification
- Outlier(s) Detection
- …
PRIMsrc
implements a unified treatment of the "Bump Hunting" task in high-dimensional space. It uses a generic rule-induction algorithm by recursive peelings derived from the Patient Rule Induction Method (PRIM), initially introduced by Fisher & Friedman in 1999 (see Wiki "References"). It generates simple decision rules delineating a region (or regions) in the multi-dimensional input space, where the target function is unusually larger (or smaller) than its average over the entire space.
Why Use PRIMsrc?
The fact that the method (i) makes minimal assumptions about the data, (ii) gives easily interpretable rules with estimated variance and (iii) can target for any desired responses (being supervised for Survival, Regression and Classification (SRC) settings), makes it highly attractive to the user.
Unlike classical regression, classification and clustering problems, "Bump Hunting" is interested in:
- Understanding and characterizing newly identified sub-groups of samples and homogeneous sub-populations
- Discovering and describing sub-groups of samples and sub-populations with extreme responses
- Identifying and predicting future sub-groups of samples and sub-populations with extreme responses
- Customizing and/or targeting sub-groups of samples and sub-populations with extreme responses
- …
Multiple applications exist in an increasing range of problems spanning from Medical, Engineering, Materials Research, Marketing, Business Analytics, Actuarial Science, Behavioral Science, etc... :
- subgroup finding
- disparity subtyping
- alternative drug/treatment indication (re-purposing)
- personalized medicine (improved accuracy of diagnostication and/or prognostication)
- economical medicine (hot spotting)
- system reliability analysis in engineering
- duration analysis/modeling in economics
- event history analysis in sociology
- financial securities return
- insurance risk assessment/management
- …
Readme
Visit the software Readme webpage to learn about License, Downloads, Branches, Requirements, Installation and Usage
Wiki
Visit the project Wiki webpage for Roadmap, Documentation ,Examples, Publications, Case Studies, Support and How to Contribute (code and documentation).
Authors/Contributors
Jean-Eudes Dazard, PhD.
Center for Proteomics and Bioinformatics (at the time of study/design)
Case Western Reserve University
Cleveland, Ohio, USA
J. Sunil Rao, PhD.
Division of Biostatistics
Department of Epidemiology and Public Health
The University of Miami
Miami, Florida, USA
Michael LeBlanc, PhD.
Fred Hutchinson Cancer Research Center
Public Health Sciences.
Department of Biostatistics, School of Public Health
The University of Washington
Seattle, Washington, USA
Michael Choe, MD.
Case Western Reserve University (at the time of study/design)
Cleveland, Ohio, USA
Tarn Duong, PhD.
Research scientist
Computer Science Laboratory (LIPN)
University of Paris 13
Paris, France
Acknowledgements
Project funded in part by the National Institute of Health - National Cancer Institute, Grant: R01-CA160593 awarded to J.Sunil Rao/J-E. Dazard (co-PIs). This work was also made possible thanks to the help of Alberto Santana, MBA (Analyst Programmer, CWRU) and the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. Thanks also to professional photographer Bill Wight CA for the nice illustration picture above.