This is the first part of the automated classification in Python series, in which we'll show an example of how to classify a given record usingMachine learning classification methodscan sort in python.
In the following article, we show the parsing and processing of the "Adult“ Dataset for classification. We also have our script along with the data sets and documentation.GitHubPublished.
The data set comes from the University of California Irvine Machine Learning Repository. It currently contains 473 datasets (last accessed May 10, 2019) available for machine learning applications. The adult data set is based on US census data. The goal is to use the data provided to determine whether an individual earns more or less than $50,000 per year.
data profile
The first step we take before starting the actual classification is to look at the structure of the data set. We found that the data set consists of around 45,000 pieces of personal data and is already divided into training and test data.
Part of the data (approx. 7.5%) is incomplete because data points (resources) were not specified personally. Due to the relatively small number of erroneous data records, we will initially simply ignore them in the analysis.
Personal data consists of continuous and categorical characteristics of people. Continuous data are: age, final weight, years of schooling, capital gains, capital losses, and hours per week. Categorical data are: employment, degree, marital status, occupation, ancestry, race, gender, and country of birth.
Our target variable is an individual's income, specifically whether they make less or more than $50,000 per year. Since our target variable can only have two different values, it is a binary sort. Within the dataset, the ratio of those earning less than $50,000 to those earning more is about 3:1.
Analysis of resource properties
When analyzing individual characteristics, the “final weight” characteristic was particularly noticeable: it groups similar people according to socioeconomic factors and depends on the US state in which the person lives. Due to the relatively small data set and the imprecise documentation of the underlying calculation, we decided not to consider this feature in the first calculation. A direct comparison later showed that the omission of this feature led to an improvement in the classification results in individual cases, but never to a deterioration.
To solve the problem of predicting a person's income based on the above characteristics, we use aSupervised machine learning approach, because we have a lot of labeled data. The algorithm can thus estimate the dependency of individual resources on the target. In the second part of our article, we will present some methods that we have already covered in our blog. However, all of these methods first require very precise pre-processing of the data so that our model can evaluate it and interpret values such as “Monday” or “Tuesday”. There is talk of "cleaning" the data.
data preprocessing
First, we need to pre-process our data in order to apply the different machine learning models to it. The different models compare individual characteristics of the data with each other in order to determine the connection to the goal. For this, data must be available in a uniform manner in order to enable comparability. That is why we speak of data cleansing.
With the following function we delete our data. We explain how this works in the following sections:
def setup_data(Auto):"""Configure data to sort"""train data =Auto.remove_incomplete_data(Auto.traindata) Test dates =Auto.remove_incomplete_data(Auto.test dates)Auto.y_train =Auto.set_target(Zugdaten)Auto.y_test =Auto.set_target(testdaten)# Define mockups of combined training and testing data with target variable removedcomplete data =Auto.get_dummies(traindata.append(testdata,ignore_index=TRUE).tear off(Auto.Target, Axis=1).tear off("fnlwgt", axis=1),Auto.categorical_features)Auto.x_train = fulldata[0:len(traindata)]Auto.x_test = fulldata[len(traindata):len(fulldata)]
Codesprache: PHP (php)
Although our data set is already split 2:1 into a training and a test dataset, in the meantime we still need to merge it to create dummy variables so that we can later split it back in the same ratio. This procedure offers the decisive advantage that the resulting data sets have the same form and dimensionality under all circumstances. Otherwise, if there is a missing value in the training or test dataset, the new dataset may have fewer columns or the columns with the same index may represent different feature values. As a result, the comparability of the two data sets is lost.
As a result, the comparability of the two recordings is lost. Also, there are some unknown values in the dataset that we need to specifically address. However, the proportion of data with unknown values in the dataset is relatively small (<10%). Therefore, it is possible to exclude and remove this incomplete data from the dataset. We achieve this in the "setup_data" function by calling our "remove_incomplete_data" function:
def remove_incomplete_data(Auto, Data):"""Remove all data rows containing at least 1"?".""" turning backdata.replace ("?", np.in).dropna(0,"any")
Codesprache: PHP (php)
This deletes all lines that contain at least one "?" are removed from the data log. We do this to ensure that the algorithm always gets relevant data and doesn't make relationships between unknown values. When creating the dummy variables later, these would be regarded as the same values and not interpreted as unknowns. After running the function, our data set now consists of 45,222 entries, as opposed to the previous 48,842.
Assign the target variable
In the second part of the setup_data function, we use the set_target function call to assign the target variable the values 0 or 1 depending on whether someone makes more or less than $50,000 per year.
def set_target(Auto, Data):""" Set the target values of the target variables (0.1 for both cases). """ fori in range(len(data[Auto.target].unique())): data[Auto.target] = np.where(data[Auto.target] == data[Auto.target].unique()[i], i, dados[Auto.Goal])turning backData[Auto.target].astype("int")
Codesprache: PHP (php)
Replace categorical values with dummy variables
Now, before we start sorting the data, we need to make sure that our model can handle categorical values. To do this, we generate so-called dummy variables from all categorical variables using the one-hot coding method. Each possible assignment of a categorical variable is assigned its own variable, so instead of a single variable that can take different values, there are many variables that can only take the value 0 or 1, each representing a categorical assignment of the substituted variable.
Motivation
An example: We have an object of type "Date" with the property "Weekday = {'Monday', 'Tuesday', 'Wednesday', ...}'. After creating the dummy variable, the "Weekday" resource no longer exists. Instead, each possible assignment represents a separate resource. In our example, these are: weekday_tuesday, ..., weekday_sunday. Depending on what day of the week the resource was before it was created, this variable is set to 1 and the rest to 0.
At this point, the attentive reader is probably wondering why the weekday_monday feature doesn't exist. The simple reason for the omission is that from the negative allocation of the other resources it can be implicitly concluded that an object has the value weekday_monday. Another benefit is that over-dependency, multicollinearity, of single variables is avoided. This can negatively affect the outcome, since a strong dependency can make it difficult to determine the exact effect of a particular variable in a model. The creation of the dummy variables is necessary because, as already mentioned, a model has no knowledge of a weekday and how it should interpret it. Once the dummy variable has been created, it no longer has a function, since the algorithm only distinguishes whether the property of an object has the value 0 or 1. This makes it possible to compare individual objects with their respective properties.
implementation
In the final part of our setup_data function, we create the dummies by calling the get_dummies function as follows:
def get_dummies(Auto, data, categorical_features):""" Get the models of the categorical features for the given data.""" forfunction onAuto.categorical_features:# Create dummy variable with pd.get_dummies and drop all categorical variables with dataframe.dropdata = data.join(pd.get_dummies(data[recurso], prefix=recurso, drop_first=TRUE)).drop(Feature, Achse=1)turning backData
Codesprache: PHP (php)
We create a loop that iterates through all categorical variables in the dataset. At each pass, we append all dummy variables of the respective categorical variables to the dataset using the Pandas function “get_dummies”. We then remove this categorical variable. After the loop statement completes, our dataset will no longer contain any categorical variables. Instead, it has the corresponding dummy variables.
Instead, it has the corresponding dummy variables. So we get from the original characteristics:
Age Working ClassPerson1 39 Local govPerson2 50 Federal gov
Consequences:
idade workclass_Federal-gov workclass_Local-gov workclass_Never-workedPerson1 39 0 1 0Person2 50 1 0 0
Here again, the reason for temporarily merging the two datasets becomes clear: for example, if the value “Local-gov” only exists in one of the datasets, the datasets created will have different dimensionality, if the variables are dummy variables, they are generated since the other record containing the entire column is missing.
For example, if the model makes a strong association between “local government” and having an income greater than $50,000, that association changes to the function that takes the place of local government in the other dataset. This will likely result in an incorrect result, but will definitely result in an incorrect connection.
In the last part of the "setup_data" function, we again split the data sets into a training and a test data set.
Auto.x_train = fulldata[0:len(traindata)]Auto.x_test = fulldata[len(traindata):len(fulldata)]
Codesprache: PHP (php)
in the second partwe discuss how we can apply the prepared data to different classifiers and then compare and evaluate the results.