Todd Warczak PhD
My Info
White River Junction, VT (Relocating to Seattle, NY, or other major US city)
206-999-6478
twarczak@gmail.com
github.com/TWarczak
toddwarczak.netlify.app
Education
PhD Molecular & Cellular Biology — Dartmouth College
Sep 2012 - August 2020
BS Biology — University of Utah
Sep 2008 - May 2012
Work Experience
Molecular Biologist
2012 - 2020
- Engineered genome-wide association study (GWAS) to identify genes controlling arsenic tolerance in plants. Determined Arabidopsis gene NIP1;1 on 4th chromosome as the major genetic factor for tolerating arsenic. All data cleaned/wrangled/analyzed in R.
- Built lab RNA-seq pipeline for gene expression of 25000+ plant genes with R scripts for unsupervised learning (PCA, hierarchical clustering), regression (GLMs/ANOVA), exploratory data analysis, and statistical tests.
- Developed first cell-type specific expression maps for plant genes involved in root arsenic acquisition, efflux, and sequestration (using R).
- Presented research to local community members, government officials, and other stakeholders as a representative of the Dartmouth Toxic Metals Superfund Research Program.
- Mentored 2 undergratuate scientists.
Recent Projects
SageMaker + RStudio to Predict Home Prices w/ Multi-class XGBoost; Explaining Model Behavior with Geospacial Plots and SHAP
- Explored Austin dataset to predict home price in Kaggle competition. (Blog, Github)
- Built static and interactive geospacial plots overlayed with feature data.
Feature engineered high/low important words that associate w/ price using NLP.
Trained/tuned/evaluated/deployed SageMaker Multi-class XGBoost on holdout data & submitted predicted binned price to Kaggle competition. Submission scored 0.8876 (mLogLoss), which would have placed 6th (out of 90 entries) in live competition.
- Modified {SHAPforxgboost} package to generate multi-class SHapley Additive exPlanations (SHAP) values/plots that explain how XGBoost model made predictions.
Predicting Churn using AWS SageMaker & Local RStudio
- Utilized SageMaker and built-in XGBoost algorithm to train, tune, evaluate, and deploy model for predicting bank customer churn in the SLICED season 1 episode 7 Kaggle competition. (Blog, Github)
- Configured local RStudio to make API calls to SageMaker using SageMaker Python SDK and {reticulate} R package.
- Best performing model deployed as SageMaker endpoint for real-time predictions on holdout data (w/ minimal feature engineering and pre-processing). Predictions submitted to SLICED competition received a score of 0.07622 (LogLoss), which would have placed 8th out of 130 entries.
Forecasting Daily Sales with Modeltime
- Forecasted 3 months of daily sales utilizing the {modeltime} package in R to help ‘Superstore’ company selling furniture, technology, and office supplies manage supply-chain and inventory decisions in Q1 (data source: Kaggle). (Blog, Github)
- Exploratory data analysis and 11 models tested. 6 {tidymodels} workflows tuned for individual forecasts and weighted ensemble (Support Vector Machine/Neural Network-AR/Random Forest/Prophet-XGboost) forecast.
TidyTuesday
- Weekly twitter project focusing on cleaning, wrangling, summarizing, and arranging a new dataset in R to produce a single chart. Typically using {ggplot2} and {tidyverse} tools. Shared via #TidyTuesday. github-TidyTuesday
- Mario Kart 64 World Records - Visualization of cumulative days individual records are held by players.
- Survivor - Visualization showing Myers-Briggs Type Indicator personalities of all show contestants vs. the winners.
- Great Lakes Commercial Fishing - Rolling averages of fish family yearly catch.
Data Science Skills
- R ★★★★
- Python ★★
- AWS SageMaker/S3/EC2…
- SQL
- GitHub
- Markdown/Jupyter Notebooks
- Data Cleaning/Wrangling
- Data Visualization
- Probability/Statistics
- Machine Learning
- Regression/Classification
- Time-Series Forecasting
- TensorFlow/Tidymodels/H2O…