Two Sessions: Morning 9am-12pm & Afternoon 2-5pm
12:30pm to 1:30pm: Meet the Sponsors
Lunch on your own 12-2pm
Choice of 6 different workshops for each session
Morning Workshop Descriptions
1. Tidyverse: How R Should Be Used (Beginner, R)
R has been rapidly growing as one of the most widely used programming languages for data science in recent years. However, in its most basic form, R is complicated, a bit choppy, and generally doesn’t provide much benefit over its main competitor, Python. Enter Tidyverse, a collection of packages maintained by the R foundation for reading, cleaning, processing, modeling, and visualizing data. In this workshop, we will learn the best practices for coding in the Tidyverse and will come to understand why it is giving Pandas a run for its money. We will cover everything from installing and configuring the latest versions of R and the Tidyverse to building a model on sample data and visualizing the results in a markdown document.
Presenters: Zane Murphy
Pre-requisites: Basic knowledge of R, but not required
Downloads: Install R and Tidyverse, Sample Dataset
2. Preprocessing Data for Machine Learning (Intermediate, Python)
Getting your data ready for modeling is the essential first step in the machine learning process. In this workshop, you’ll learn the basics of how and when to perform data processing. You’ll learn: how to perform basic techniques such as dealing with missing data and incorrect data types, how to standardize your data so that it’s in the right form for modeling, the benefits of creating new features to leverage the information in your dataset, and the process for selecting the best features to improve your model fit.
Presenter: Sarah Guido
Pre-requisites: Basic familiarity with scikit-learn and pandas is necessary for this workshop.
Downloads: Docker. Sarah will create a Docker image with the necessary notebooks and libraries.
3. Regression 3 Ways (or 4, or 5) (Beginner, R, Python, Julia, Jupyter Notebooks, Scala, Haskell)
As a failing contestant on a cooking show will cook their dish three ways, we will be doing the same regressions in R, Python, and Julia. We will do linear and logistic regressions by loading the same data, generating summary statistics, visualizing variable relationships, and building a model in these 3 languages. We’ll be using the same front end of Jupyter Notebooks. If time permits, we’ll do the regressions again in Scala and Haskell. There are many tools for data science and by looking at the same problem in a few of them, we’ll begin to understand why one would choose one tool over another. Unfortunately, the problems are superficial and won’t provide a sufficient reason for selecting a data science language.
Presenter: Oliver Will
Pre-requisites: None. However, Oliver intends to get through the examples in R, Python, and Julia. Will move quickly and will not slow down if it will prevent finishing these examples.
Downloads: R, Python, Julia, Jupyter Notebook. Have R, Python, and Julia working on your computer before you arrive. Detailed instructions at this GitHub repository https://github.com/owill/DataPhilly2020.
4. From Stored Data to Data Stories (Beginner, R, Python, (& notebooks: RMarkdown, Jupyter))
This workshop will take one through the steps associated with an end-to-end machine learning campaign: data retrieval; data curation; model construction, evaluation, selection and interpretation; and reporting. Particular attention will be paid to reporting, i.e., building a narrative. Examples will be presented demonstrating how one might generate multiple output formats (e.g., HTML pages, presentation slides, PDF documents) starting with the same code base.
As a specific example, a data narrative will be built showing how one might build predictive models for the toxicity of organic molecules. Reports will be presented as (1) an HTML file, (2) a PDF or Word document (in a format acceptable for journal submission), and (3) a slide presentation.
While the workshop’s example comes from the field of cheminformatics, the computational tools used and the exercises presented are applicable to any field where an investigator is interested in building predictive models, and describing these models to colleagues and associates.
The R and Python ecosystems will be used throughout. All data, code, and text will be made available.
At the workshop’s conclusion attendees will have worked through exercises that may serve as templates to be used with their data as they build their data narratives.
Presenter: Paul J Kowalczyk
Pre-requisites: Basic familiarity with R and Python
Downloads: R (RStudio) and Python (Jupyter)
5. Satellite Imagery Analysis with Python (Intermediate, Python)
Participants will learn the basics of working with geospatial data in Python. They will learn how to generate basic analytics using both vector (e.g. points, lines, and polygons) and raster (e.g. satellite imagery) datasets. The workshop will also discuss how to prepare imagery and labels for training machine learning models. Throughout the process of the workshop, attendees will be introduced to indispensable open-source geospatial libraries like GDAL, Rasterio, GeoPandas, and Shapely.
Presenters: Simon Kassel, Ross Bernet
Pre-requisites: Basic python knowledge
Downloads: Docker (recommended) or the following Python libraries: Rasterio, Shapely, GeoPandas, Numpy and Jupyter; and command-line tools: GDAL, AWS CLI
6. Thinking Like a Data Science Expert (Intermediate, Python)
Many programming books on data science focus on analyses as a static, final code document. However, screencasts of experts doing scrappy, hourlong analyses reveal many dynamic processes at work. Through analyzing and reproducing the recorded work of an expert, attendees will learn how to systematically address knowledge gaps and fluently moving from thought to action on the path to expertise.
Presenters: Michael Chow
Pre-requisites: Some familiarity with pandas, including calculating grouped means, grouped transforms, and possibly inner joins.
Downloads: Install list of packages running on Jupyter Lab, or run a Docker container.
Afternoon Workshop Descriptions
7. A Tour of the Tidyverse – data wrangling and visualization in R (Intermediate, R)
This is a two-part workshop that will cover two intermediate R topics using tidyverse packages. In part 1, participants will learn how to ingest, clean, and reshape difficult data. In part 2, we will cover some more advanced topics of data visualization in R, specifically focusing on the ggplot2 package. Topics will cover tips and tricks, common troubles with ggplot, library extensions, and best practices.
Presenters: Jake Riley (lead), Alice Walsh
Pre-requisites: The workshop assumes introductory knowledge of R (have completed something like “R for Data Science” and have used R at work or for fun). A laptop is required. The workshop can be completed either locally in RStudio or online with RStudio Cloud.
Downloads: Participants are invited to bring their own data and data cleaning problems for discussion.
8. Hyperparameter Optimization (Intermediate, Python)
Attendees will be able to:
- Define a hyperparameter and explain why they are important
- Describe specific hyperparameters
- Understand different methods for hyperparameter tuning
- Perform hyperparameter optimization
- Build a model from scratch while incorporating hyperparameter optimization
Presenters: Ben Attix (lead), Sayandeep Acharya, Nick Ceneviva, Ankush Israney
Pre-requisites: Attendees should understand what tree-based models (e.g. Random Forest) are from a high level. An in-depth understanding is NOT required.
Downloads: Anaconda, Git Bash (for Windows users only)
9. Why is t-test called t-test? (Intermediate, R or Python)
Statistical tests are one of the key tools in a Data Scientist’s tool chest, but many people find it difficult to understand when and how to apply them. This workshop will help intermediate (and beginning) Data Scientists to deepen the knowledge in statistical tests using some computational techniques. It will be especially helpful for people who took an introductory college-level statistics class and want to learn more about it. The workshop will cover topics like:
- Random number generation and resampling techniques;
- Monte-Carlo integration;
- Sampling distribution;
- Central limit theorem; and
- Lesser known distributions like Chi-square.
This will be similar to Jake Vanderplas’s talk at 2016 PyCon, titled Statistics for Hackers, with some hands-on elements; it will also include analytical aspects of statistical tests.
Presenter: Junghoon Lee
Pre-requisites: Some experience in programming in R or Python is required.
Downloads: ggplot for R and scipy and matplotlib for Python
10. Intro to SQL and Snowflake (Beginner, SQL)
SQL is the back-end language for direct data manipulation for many major websites, databases, and computer systems around the world. This workshop is intended to give you a clear understanding of basic SQL and database concepts. This workshop will utilize Snowflake, a data storage and management tool for the cloud.
Presenter: Kayleigh Smoot
Pre-requisites: None Downloads: N/A
11. Generative Adversarial Models (Advanced, Python)
We’re going to build a GAN from scratch. Bring your own dataset of small, standardized images and you can build one too.
Presenter: Andrew Crane-Droesch
Pre-requisites: Know how neural nets work Downloads: Tensorflow 2.0 running on python. Try copy-pasting the code from one of google’s MNIST tutorials to make sure it works.
12. Social media and crowd sourced digital data in health care (Intermediate, Python)
Using Facebook language to predict depression? Twitter as a potential data source for cardiovascular disease research? The relationship between Google search volume and cancer incidence? Learn how researchers at the Penn Medicine Center for Digital Health are mapping the human digital existence to find solutions to problems that lie at the intersection of health care, technology, and society. Come join us for a NLP+ML fundamentals, practice extracting, organizing, preprocessing, and analyzing real data, and engage in a lively discussion about digital data and society.
Presenters: Lauren Southwick, Arthur Pelullo, Andy Anietie, Sharath Guntuku (lead), Elissa Klinger, Haley McCalpin
Pre-requisites: Basic python knowledge and familiarity with scikit-learn and pandas is necessary for this workshop.
Downloads: Python, Anaconda 3.0
Many thanks to our sponsors for helping to make these workshops free to our community!
Wharton Research Data Services – https://wrds-www.wharton.upenn.edu/
Analytics at Wharton – https://www.wharton.upenn.edu/analytics/
Slalom Consulting – https://www.slalom.com/locations/philadelphia
HVH Precision Analytics – https://www.hvhprecision.com/careers/
Pinnacle 21 – https://www.pinnacle21.com/
Promptworks – https://www.promptworks.com/jobs
Elastic – https://www.elastic.co/
Jornaya – https://www.jornaya.com/
High Availability, Inc. – https://hainc.com/
Azavea – https://careers.azavea.com/
Linode – https://www.linode.com/