Sign in

Data scientist, social scientist, and educator who cannot think of anything witty to put here. https://www.linkedin.com/in/scott-adams-phd/
Photo by JOSHUA COLEMAN on Unsplash

Knowing Structured Query Language, or SQL, is foundational for various data professions, including data analysts, data engineers, and data scientists. Why? Well, in many organizations data is stored in relational databases and SQL is the standard language to query and pull data from relational databases. However, I have encountered a number of individuals who are not sure how to start learning SQL because they have not been exposed to a relational database. Fortunately, there is a relatively straightforward way to create and manage your own relational database that I will illustrate in this article. …


Hands-on Tutorials

Have you ever wondered how to extract information from a scikit-learn pipeline?

Photo by Mika Baumeister on Unsplash

Written By Scott A. Adams and Aamodini Gupta

Introduction

Scikit-learn pipelines are useful tools that provide extra efficiency and simplicity to data science projects (if you are unfamiliar with scikit-learn pipelines see Vickery, 2019 for a great overview). Pipelines can combine and structure multiple steps, from data transformation to modeling, all within a single object. Despite their overall usefulness, there can be a learning curve to using them. In particular, peeking into the individual pipeline steps and extracting important pieces of information from said steps is not always the most intuitive process. …


Data-driven does not mean data drives itself

Photo by Julian Hochgesang on Unsplash

Introduction

For all the hype around data and data-hyphenated terms (like “data-driven”), it is important to remember that data is a raw resource that has no actualized value until it is integrated into a product that uses said data to generate a meaningful output. Though the specific roles and responsibilities of data scientists vary from organization to organization (and even within organizations), data scientists are generally the ones responsible for executing the actual transformation of data from a raw resource into something of value for product-users. Indeed, data science is not so much the scientific study of data itself as it…


Spoiler alert: No

Photo by Element5 Digital on Unsplash

Election polling has again come under scrutiny after several discrepancies between polling predictions and election outcomes in the 2020 election. First, a number of presidential races in battleground states such as Michigan and Wisconsin turned out to be tighter than polling indicated. Second, in Florida and North Carolina, Biden was projected to win according to pre-election polling but ended up losing. Third, Trump had convincing vote share margins in states such as Ohio and Texas, which were thought to be closer races based on polling (see Silver, 2020).

Nate Silver argues that while there have been divergences between projected presidential…


Photo by Antonio Grosz on Unsplash

Introduction

Many problems that data scientists, statisticians, and other data practitioners encounter require a determination of whether the observations of interest are likely to belong to one category or another on some outcome. Examples include assessments of creditworthiness (e.g., will a potential borrower default on their debt?), the flagging of credit card purchases that may be fraudulent, and object classification (e.g., is this plant an iris setosa or not?). While it is technically possible to calculate a probability of belonging to one category versus another using a linear regression model, a more appropriate regression technique is logistic regression. …


Photo by Twitter: @jankolario on Unsplash

Transforming spreadsheets into queryable database tables

Introduction

A relational database is a collection of data tables — sets of rows and columns that store individual pieces of data — that can be connected to each other. In this way, a relational database is not totally dissimilar from an Excel workbook with related datasets stored across multiple worksheets. With that thought in mind, this post moves through an example using Python to transform an Excel spreadsheet into a database that can be queried using Structured Query Language (SQL).

Data

The example in this post uses data from the Superstore-Sales dataset, which can be found here. This dataset is stored…


Photo by Ryan Searle on Unsplash

Introduction

The first time I explored regression in Python I dove headfirst into scikit-learn, a package that provides a number of useful tools for developing predictive models. I ran a simple linear regression model and output my intercept, coefficients, and model fit metrics. Being a newcomer to Python, coming from a background heavily focused on statistical inference, and not yet fully grasping the differences between statistics and data science, I then spent a good amount of time looking for ways to output the standard errors, confidence intervals, and p-values of the regression weights.

In my search for information across multiple sources…

Scott A. Adams

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store