[home]     [professional]     [personal]     [resume]

Project:

"Detecting Time Points in Protein Folding Data"


Example dataset and result

    [Sample]


Abstract:

Gerra L. Bosco, a PhD student at the University of Chicago, visited our consulting program for help with analyzing her data related to protein folding. Omar de la Cruz was the consultant on that project, but he was kind enough to let me step in and work on a specific part of the problem that I found interesting.

The data are a time series of phosphorescent magnitudes thatwere recorded by a very sensitive machine. The machine begins by agitating/shaking the sample for a few moments, and then collects data on the amount of phosphorescence as the proteins in the sample fold up. The machine marks the approximate time that the agitation stopped and the 'real data' began, however this time stamp is not precise to the level required. The earliest data is very important and Gerra wanted to be sure that her analysis did not exclude any time points that were 'real' and therefore relevant. Although it was somewhat easy to detect these starting points by eye, Gerra had many hundreds of files to analyze, and therefore an automated process was desired. My goal, then, was to develop an automated process to detect the appropriate 'starting time' where the agitation has ended and the real data begins. My final resulting algorithm evaluates all time points within a generous region of the machine stamped time-point by fitting a non-linear regression to all subsequent data and evaluating the R-squared value for that model. The time point selected is that which has the maximum R-squared value. The exponential nature of the data leads to a preference for earlier time-points whenever the early points to not appearto produce a lack-of-fit. Below is a link to a plot of one of the datasets along with the selected timepoint in red and the machine stamped time point as a dashed line. The solid black lines represent the boundaries of the region that was investigated.