This is the first part of the automated classification in Python series, in which we’ll show an example of how to classify a given record usingMachine learning classification methodscan sort in python.
In the following article, we show the parsing and processing of the “Adult“ Dataset for classification. We also have our script along with the data sets and documentation.GitHubPublished.
The data set comes from the University of California Irvine Machine Learning Repository. It currently contains 473 datasets (last accessed May 10, 2019) available for machine learning applications. The adult data set is based on US census data. The goal is to use the data provided to determine whether an individual earns more or less than $50,000 per year.
data profile
The first step we take before starting the actual classification is to look at the structure of the data set. We found that the data set consists of around 45,000 pieces of personal data and is already divided into training and test data.
Part of the data (approx. 7.5%) is incomplete because data points (resources) were not specified personally. Due to the relatively small number of erroneous data records, we will initially simply ignore them in the analysis.
Personal data consists of continuous and categorical characteristics of people. Continuous data are: age, final weight, years of schooling, capital gains, capital losses, and hours per week. Categorical data are: employment, degree, marital status, occupation, ancestry, race, gender, and country of birth.
(Video) Machine Learning Tutorial Python – 18: K nearest neighbors classification with python code
Our target variable is an individual’s income, specifically whether they make less or more than $50,000 per year. Since our target variable can only have two different values, it is a binary sort. Within the dataset, the ratio of those earning less than $50,000 to those earning more is about 3:1.
Analysis of resource properties
When analyzing individual characteristics, the “final weight” characteristic was particularly noticeable: it groups similar people according to socioeconomic factors and depends on the US state in which the person lives. Due to the relatively small data set and the imprecise documentation of the underlying calculation, we decided not to consider this feature in the first calculation. A direct comparison later showed that the omission of this feature led to an improvement in the classification results in individual cases, but never to a deterioration.
To solve the problem of predicting a person’s income based on the above characteristics, we use aSupervised machine learning approach, because we have a lot of labeled data. The algorithm can thus estimate the dependency of individual resources on the target. In the second part of our article, we will present some methods that we have already covered in our blog. However, all of these methods first require very precise pre-processing of the data so that our model can evaluate it and interpret values such as “Monday” or “Tuesday”. There is talk of “cleaning” the data.
data preprocessing
First, we need to pre-process our data in order to apply the different machine learning models to it. The different models compare individual characteristics of the data with each other in order to determine the connection to the goal. For this, data must be available in a uniform manner in order to enable comparability. That is why we speak of data cleansing.
With the following function we delete our data. We explain how this works in the following sections:
def setup_data(Auto):"""Configure data to sort"""train data =Auto.remove_incomplete_data(Auto.traindata) Test dates =Auto.remove_incomplete_data(Auto.test dates)Auto.y_train =Auto.set_target(Zugdaten)Auto.y_test =Auto.set_target(testdaten)# Define mockups of combined training and testing data with target variable removedcomplete data =Auto.get_dummies(traindata.append(testdata,ignore_index=TRUE).tear off(Auto.Target, Axis=1).tear off("fnlwgt", axis=1),Auto.categorical_features)Auto.x_train = fulldata[0:len(traindata)]Auto.x_test = fulldata[len(traindata):len(fulldata)]
Codesprache: PHP (php)
Although our data set is already split 2:1 into a training and a test dataset, in the meantime we still need to merge it to create dummy variables so that we can later split it back in the same ratio. This procedure offers the decisive advantage that the resulting data sets have the same form and dimensionality under all circumstances. Otherwise, if there is a missing value in the training or test dataset, the new dataset may have fewer columns or the columns with the same index may represent different feature values. As a result, the comparability of the two data sets is lost.
(Video) Machine Learning for Everybody – Full Course
As a result, the comparability of the two recordings is lost. Also, there are some unknown values in the dataset that we need to specifically address. However, the proportion of data with unknown values in the dataset is relatively small (