i've dived deep into several papers adjacent to my plan.
some key takeaways
regarding eric hamberg (ovarian cancer; observe matured organoids to generate endpoint living vs. dead regions):
- inference with a complex model (combining UNET and PatchGAN-style CNN) built on extremely sophisticated data extrapolations still yields bad results, as evidenced in this paper. It only performed well in middle ground of some success vs. failure, not complete success or complete failure.
- simplicity might trump complexity in both data and model architecture?
regarding matsumoto (HP; single snapshot, for multiple time windows during maturation, mapping to binary classification as endpoint):
- must adjust focus during imaging on periphery to help with good preprocessing (segmentation + extraction). this is very very important since most of the non-area related data are all dependent on the "bark" of the clusters!
- morphology outweighs bulk RNA-seq analysis! Overall structure, even from the surface, is better than genetic level information! This is actually crazy technical validation!
- have the right amount of data; increasing data count by a certain extent and stop before diminishing laws of return occurs (the logarithmic growth of performance [y] vs. computational cost [x])
- have the right image size; having higher resolution to a certain extent and stop before diminising laws of return occurs (just like the data count)
regarding chen xian (retinal; true longitudinal - training data contains larger time window w/ both inference input + inference output - inference looks at smaller window):
- diffusion could be the long game (?); the great performance comes with huge computational cost. there is a trade off with diffusion.
- not relevant enough because the ground truth is the same type of data as the input (like IHC for industry-standard cortical ground truth)
regarding asano (HP; day 30 brightfield to predict its florescence gene expression on same day):
- ensemble should be looked into since it produced marginal performance increase.
- the potential necessity to highlight (visually) the fact that ai is generally better than humans at predictions for pitching - this would require some human inference [might be time consuming, so optional]
regarding my plan:
- stick w/ simple architectures for now, and maybe for the future as well; cost is the main bottleneck and therefore efficiency should be highly regarded (aim for max performance at high efficiency - essentially be like the M series chips and not the Intel x64 ones)
- look into ensembles (especially for gbdt's)
- don't stretch for data count + large resolution if unecessary, but for now definitely scale PoC data as large as possible, and run conservative data collection for the real deal (at a substantially larger capacity ofc, but not exceeding the 1500s)
- true longitudinal tracking might enable better correlation discovery; this is an single X-ray scan by doctors in hospital vs. a personal trainer who knows your change of workout, habits, diets, sleep, HRV overtime.
- morphology might be underrated (countering previous statement); however, asano was doing binary classification, whereas i aim for regression (although pivot is an option if necessary)
- rule-base algorithms might alleviate some pain for the modeling through gatekeeping (purist approach for final data scheme
- be consistent with data collection to eliminate as many independnet variables as possible (preventable by human). this includes reagents, time window of the actual day, imaging lighting + contrast + (MOST IMPORTANTLY) the right focus ← try autofocus.
wagmi