diff --git a/docs/index.html b/docs/index.html index 1dcf51d..9502c77 100644 --- a/docs/index.html +++ b/docs/index.html @@ -215,7 +215,7 @@

-
+
@@ -289,7 +289,7 @@

-Features like zipcodes or industry codes contain information that is part of a taxomy. Although these feature values might be numerical, it doesn’t necessarily make sense to… +Features like zipcodes or industry codes (NAICS, MCC) contain information that is part of a taxomy. Although these feature values are numerical, it doesn’t necessarily make…
diff --git a/docs/posts/Deduplipy.html b/docs/posts/Deduplipy.html index c4018f6..115a208 100644 --- a/docs/posts/Deduplipy.html +++ b/docs/posts/Deduplipy.html @@ -183,7 +183,7 @@

Deduplication of records using DedupliPy

-

+

Deduplication or entity resolution is the task to combine different representations of the same real world entity. The Python package DedupliPy implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset. In this post I demonstrate how the package works and show more advanced settings. In case you want to apply entity resolution on large data in Spark, please have a look at Spark-Matcher, a package I developed together with two colleagues.

Installation

diff --git a/docs/posts/Deduplipy_files/figure-html/9e137887-1-56aaf717-0970-440c-b3f9-0aec94500f2d.png b/docs/posts/Deduplipy_files/figure-html/957bd8db-1-56aaf717-0970-440c-b3f9-0aec94500f2d.png similarity index 100% rename from docs/posts/Deduplipy_files/figure-html/9e137887-1-56aaf717-0970-440c-b3f9-0aec94500f2d.png rename to docs/posts/Deduplipy_files/figure-html/957bd8db-1-56aaf717-0970-440c-b3f9-0aec94500f2d.png diff --git a/docs/posts/taxonomy_encoder_blog.html b/docs/posts/taxonomy_encoder_blog.html index 329f2db..6df50df 100644 --- a/docs/posts/taxonomy_encoder_blog.html +++ b/docs/posts/taxonomy_encoder_blog.html @@ -162,6 +162,7 @@

On this page

  • Taxonomy Encoder
  • Hyperparameter tuning
  • Taxonomy Encoder for binary classification
  • +
  • Final remark
  • @@ -187,12 +188,12 @@

    Taxonomy feature encoding

    Introduction

    -

    Features like zipcodes or industry codes contain information that is part of a taxomy. Although these feature values might be numerical, it doesn’t necessarily make sense to use them as ordinal features; a region’s zipcode might be a higher value than another region’s zipcode, that doesn’t mean that there is valuable ranking in these values. If we encode taxonomy bearing features use One-Hot-Encoding, the number of features blows up tremendously. Moreover, we loose helpful information on the similarity of adjacent values. E.g. the MCC industry codes for ‘Commerical clothing’ (5137) and ‘Commerical footwear’ (5139) are clearly more similar than for example ‘Child Care services’ (8351). We could overcome this issue by One-Hot-Encoding at a higher level (e.g. 5xxx for ‘Stores’ and 8xxx for ‘Professional Services and Membership Organizations’) but dependent on the modelling task, we might want to have higher granularity in specific parts of the possible feature values.

    +

    Features like zipcodes or industry codes (NAICS, MCC) contain information that is part of a taxomy. Although these feature values are numerical, it doesn’t necessarily make sense to use them as ordinal features; a region’s zipcode might be a higher value than another region’s zipcode, that doesn’t mean that there is valuable ranking in these values. If we encode taxonomy bearing features using One-Hot-Encoding, the number of features blows up tremendously. Moreover, we loose helpful information on the similarity of adjacent values. E.g. the MCC industry codes for ‘Commerical clothing’ (5137) and ‘Commerical footwear’ (5139) are clearly more similar than for example ‘Child Care services’ (8351). We could overcome this issue by One-Hot-Encoding at a higher level (e.g. 5xxx for ‘Stores’ and 8xxx for ‘Professional Services and Membership Organizations’) but dependent on the modelling task, we might want to have higher granularity in specific parts of the possible feature value range.

    To overcome these issues, I created the Taxonomy Encoder for scikit-learn. Before going to the implementation, let’s first have a look at a practical application; house price prediction in the Netherlands.

    House price prediction

    -

    When predicting house prices, zipcode is an example of a taxonomy bearing feature where we need different granularity for encoding the feature in different regions. Figure 1 shows average house prices in the Netherlands in 2023 per zipcode (source: CBS). We know from experience that in cities house prices in different zones can differ a lot, even if the zones are only a few kilometers away. In more rural areas, these differences are often much less prevalent. To illustrate this, Figure 2 zooms in on Amsterdam (a) and the province of Limburg (b). Amsterdam is a relatively small city that has zones with house prices in the lower end and the most expensive house in the country. The province of Limburg is a much larger area but has significantly less variation in house prices. Going back to our aim of encoding the zipcode feature; we need different granularity for cities than for the country side. The question is how to choose this granularity.

    +

    When predicting house prices, zipcode is an example of a taxonomy bearing feature where we need different granularity for encoding the feature in different regions. Figure 1 shows average house prices in the Netherlands in 2023 per zipcode (source: CBS). We know that in cities house prices in different zones differ a lot, even if the zones are only a few kilometers away. In more rural areas, these differences are often much less prevalent. To illustrate this, Figure 2 zooms in on Amsterdam (a) and the province of Limburg (b). Amsterdam is a relatively small city that has zones with house prices in the lower end and the most expensive houses in the country. The province of Limburg is a much larger area but has significantly less variation in house prices. Going back to our aim of encoding the zipcode feature; we need different granularity for cities than for the country side. The question is how to choose this granularity.

    @@ -226,7 +227,7 @@

    House price predict

    Figure 2: House price average per zip code, note the difference in price homogeneity in cities vs rural areas

    -

    Let’s use a decision tree regressor to create segments of zipcodes that are homogenous with respect to mean house prices. As I’m lacking a dataset with house prices of individual houses, I’m going to create such dataset by concatenating the CBS dataset 10 times and multiply house prices with a random factor between 0.9 and 1.1 to introduce some variation. The decision tree regressor is fitted with max_leaf_nodes set to 50. This means that the zipcodes will be placed in 50 segments. To illustrate the effectiveness of this method, I show in-sample predictions for the most expensive (Table 1) and the least expensive areas (Table 2). The two tables show encoded mean house prices, the range of zipcodes, the number of zipcodes in that range (apparently not all values between 1000 and 9999 are used as zipcodes!) and the cities where these zipcodes are in. Clearly, the most expensive areas are much smaller and require a higher granularity of zipcode encoding than areas with lower house prices. Note how Amsterdam has even three distinct zipcode areas in the top 10. If we would use these in-sample generated zipcode encodings in our model, we would make the unforgivable mistake of information leakage. The house price of each house is used to generate a feature that is used to predict that same house price. This is where the Taxonomy Encoder comes into play.

    +

    Let’s use a decision tree regressor to create segments of zipcodes that are homogenous with respect to mean house prices. As I’m lacking a dataset with house prices of individual houses, I’m going to create such dataset by concatenating the CBS dataset 10 times and multiply house prices with a random factor between 0.9 and 1.1 to introduce some variation. The decision tree regressor is fitted with max_leaf_nodes set to 50. This means that the zipcodes will be placed in 50 segments. To illustrate the effectiveness of this method, I show in-sample predictions for the most expensive (Table 1) and the least expensive areas (Table 2). The two tables show encoded mean house prices, the range of zipcodes, the number of zipcodes in that range (apparently not all possible values are used as zipcodes!) and the cities where these zipcodes are in. Clearly, the most expensive areas are much smaller and require a higher granularity of zipcode encoding than areas with lower house prices. Note how Amsterdam has even three distinct zipcode areas in the country top 10. If we use these in-sample generated zipcode encodings in our model, we would make the mistake of information leakage. The house price of each house would be used to generate a feature that is used to predict that same house price. This is where the Taxonomy Encoder comes into play.

    @@ -476,7 +477,7 @@

    House price predict

    Taxonomy Encoder

    -

    The Taxonomy Encoder is a type of target encoder, as for example implemented in scikit-learn. A plain vanilla target encoder encodes a feature value by the target mean value for all samples within that category. When training a model, it’s important that out-of-sample target encodings are used to prevent information leakage. This is achieved by internal using cross-fitting and prediction. We’re going to apply the same logic in the Taxonomy Encoder. The difference with a normal target encoder is that we don’t take encode the feature by taking the mean of the target for all samples within the same category, but we take the decision tree prediction instead. By setting the maximum numer of leafs max_leaf_nodes we choose how many segments the decision tree will create. More nodes mean higher granularity but could also result in overfitting. When there are only a small amount of samples in a particular segment, we might want that segment to be merged with another segment; we set this by min_samples_leaf - again to avoid overfitting. The implementation of the Taxonomy Encoder is rather straightforward:

    +

    The Taxonomy Encoder is a type of target encoder, as for example implemented in scikit-learn. A plain vanilla target encoder encodes a feature value by the target mean value for all samples within that category. When training a model, it’s important that out-of-sample target encodings are used to prevent information leakage. This is achieved by internal using cross-fitting and prediction. We’re going to apply the same logic in the Taxonomy Encoder. The difference with a normal target encoder is that we don’t take encode the feature by taking the mean of the target for all samples within the same category, but we take the decision tree prediction instead. By setting the maximum numer of leafs max_leaf_nodes we choose how many segments the decision tree will create. More nodes mean higher granularity but could also result in overfitting. When there are only a small amount of samples in a particular segment, we might want that segment to be merged with another segment; we set this by min_samples_leaf - again to avoid overfitting. The implementation of the Taxonomy Encoder is rather straightforward:

    from sklearn.base import TransformerMixin, BaseEstimator
     from sklearn.model_selection import cross_val_predict
    @@ -592,7 +593,7 @@ 

    Taxonomy Encoder

    Hyperparameter tuning

    -

    The TaxonomyEncoder introduces an additional hyperparameter to tune: the number of segments in which the zipcodes are combined. Below are plots of different values for max_leaf_nodes for encoding zipcodes in our house price prediction example:

    +

    The TaxonomyEncoder introduces an additional hyperparameter to tune: the number of segments in which the zipcodes are combined. Below are plots of different values for max_leaf_nodes for encoding zipcodes in our house price prediction example. Notice how only a large number of leafs results in segmentation of the east side of the country.

    @@ -689,6 +690,10 @@

    Taxonomy Encoder for binary classification

    def get_feature_names_out(self, input_features): return [f"txe_{x}" for x in input_features]
    +
    +
    +

    Final remark

    +

    It’s important to shuffle your data before using the TaxonomyEncoder. The reason is that the cross_val_predict uses (Stratified)KFold for cv splitting and the function doesn’t allow for usage of any cv-splitter that involves shuffling. If your data is sorted by the taxonomy bearing feature, the cross_val_predict values will be useless. When we set max_leaf_nodes to 3 in our example, the houseprices of zipcodes 7000-9999 would be encoded based on the zipcodes 1000-7000 etc. The alternative solution is to extend the TaxonomyEncoder with a custom cv splitter that implements shuffling.

    diff --git a/docs/search.json b/docs/search.json index 91a73a1..09283ae 100644 --- a/docs/search.json +++ b/docs/search.json @@ -32,21 +32,21 @@ "href": "posts/taxonomy_encoder_blog.html", "title": "Taxonomy feature encoding", "section": "", - "text": "Features like zipcodes or industry codes contain information that is part of a taxomy. Although these feature values might be numerical, it doesn’t necessarily make sense to use them as ordinal features; a region’s zipcode might be a higher value than another region’s zipcode, that doesn’t mean that there is valuable ranking in these values. If we encode taxonomy bearing features use One-Hot-Encoding, the number of features blows up tremendously. Moreover, we loose helpful information on the similarity of adjacent values. E.g. the MCC industry codes for ‘Commerical clothing’ (5137) and ‘Commerical footwear’ (5139) are clearly more similar than for example ‘Child Care services’ (8351). We could overcome this issue by One-Hot-Encoding at a higher level (e.g. 5xxx for ‘Stores’ and 8xxx for ‘Professional Services and Membership Organizations’) but dependent on the modelling task, we might want to have higher granularity in specific parts of the possible feature values.\nTo overcome these issues, I created the Taxonomy Encoder for scikit-learn. Before going to the implementation, let’s first have a look at a practical application; house price prediction in the Netherlands." + "text": "Features like zipcodes or industry codes (NAICS, MCC) contain information that is part of a taxomy. Although these feature values are numerical, it doesn’t necessarily make sense to use them as ordinal features; a region’s zipcode might be a higher value than another region’s zipcode, that doesn’t mean that there is valuable ranking in these values. If we encode taxonomy bearing features using One-Hot-Encoding, the number of features blows up tremendously. Moreover, we loose helpful information on the similarity of adjacent values. E.g. the MCC industry codes for ‘Commerical clothing’ (5137) and ‘Commerical footwear’ (5139) are clearly more similar than for example ‘Child Care services’ (8351). We could overcome this issue by One-Hot-Encoding at a higher level (e.g. 5xxx for ‘Stores’ and 8xxx for ‘Professional Services and Membership Organizations’) but dependent on the modelling task, we might want to have higher granularity in specific parts of the possible feature value range.\nTo overcome these issues, I created the Taxonomy Encoder for scikit-learn. Before going to the implementation, let’s first have a look at a practical application; house price prediction in the Netherlands." }, { "objectID": "posts/taxonomy_encoder_blog.html#introduction", "href": "posts/taxonomy_encoder_blog.html#introduction", "title": "Taxonomy feature encoding", "section": "", - "text": "Features like zipcodes or industry codes contain information that is part of a taxomy. Although these feature values might be numerical, it doesn’t necessarily make sense to use them as ordinal features; a region’s zipcode might be a higher value than another region’s zipcode, that doesn’t mean that there is valuable ranking in these values. If we encode taxonomy bearing features use One-Hot-Encoding, the number of features blows up tremendously. Moreover, we loose helpful information on the similarity of adjacent values. E.g. the MCC industry codes for ‘Commerical clothing’ (5137) and ‘Commerical footwear’ (5139) are clearly more similar than for example ‘Child Care services’ (8351). We could overcome this issue by One-Hot-Encoding at a higher level (e.g. 5xxx for ‘Stores’ and 8xxx for ‘Professional Services and Membership Organizations’) but dependent on the modelling task, we might want to have higher granularity in specific parts of the possible feature values.\nTo overcome these issues, I created the Taxonomy Encoder for scikit-learn. Before going to the implementation, let’s first have a look at a practical application; house price prediction in the Netherlands." + "text": "Features like zipcodes or industry codes (NAICS, MCC) contain information that is part of a taxomy. Although these feature values are numerical, it doesn’t necessarily make sense to use them as ordinal features; a region’s zipcode might be a higher value than another region’s zipcode, that doesn’t mean that there is valuable ranking in these values. If we encode taxonomy bearing features using One-Hot-Encoding, the number of features blows up tremendously. Moreover, we loose helpful information on the similarity of adjacent values. E.g. the MCC industry codes for ‘Commerical clothing’ (5137) and ‘Commerical footwear’ (5139) are clearly more similar than for example ‘Child Care services’ (8351). We could overcome this issue by One-Hot-Encoding at a higher level (e.g. 5xxx for ‘Stores’ and 8xxx for ‘Professional Services and Membership Organizations’) but dependent on the modelling task, we might want to have higher granularity in specific parts of the possible feature value range.\nTo overcome these issues, I created the Taxonomy Encoder for scikit-learn. Before going to the implementation, let’s first have a look at a practical application; house price prediction in the Netherlands." }, { "objectID": "posts/taxonomy_encoder_blog.html#house-price-prediction", "href": "posts/taxonomy_encoder_blog.html#house-price-prediction", "title": "Taxonomy feature encoding", "section": "House price prediction", - "text": "House price prediction\nWhen predicting house prices, zipcode is an example of a taxonomy bearing feature where we need different granularity for encoding the feature in different regions. Figure 1 shows average house prices in the Netherlands in 2023 per zipcode (source: CBS). We know from experience that in cities house prices in different zones can differ a lot, even if the zones are only a few kilometers away. In more rural areas, these differences are often much less prevalent. To illustrate this, Figure 2 zooms in on Amsterdam (a) and the province of Limburg (b). Amsterdam is a relatively small city that has zones with house prices in the lower end and the most expensive house in the country. The province of Limburg is a much larger area but has significantly less variation in house prices. Going back to our aim of encoding the zipcode feature; we need different granularity for cities than for the country side. The question is how to choose this granularity.\n\n\n\n\n\nFigure 1: House price per zip code in the Netherlands\n\n\n\n\n\n\n\n\n\n\n\n(a) City of Amsterdam\n\n\n\n\n\n\n\n(b) Province of Limburg\n\n\n\n\nFigure 2: House price average per zip code, note the difference in price homogeneity in cities vs rural areas\n\n\nLet’s use a decision tree regressor to create segments of zipcodes that are homogenous with respect to mean house prices. As I’m lacking a dataset with house prices of individual houses, I’m going to create such dataset by concatenating the CBS dataset 10 times and multiply house prices with a random factor between 0.9 and 1.1 to introduce some variation. The decision tree regressor is fitted with max_leaf_nodes set to 50. This means that the zipcodes will be placed in 50 segments. To illustrate the effectiveness of this method, I show in-sample predictions for the most expensive (Table 1) and the least expensive areas (Table 2). The two tables show encoded mean house prices, the range of zipcodes, the number of zipcodes in that range (apparently not all values between 1000 and 9999 are used as zipcodes!) and the cities where these zipcodes are in. Clearly, the most expensive areas are much smaller and require a higher granularity of zipcode encoding than areas with lower house prices. Note how Amsterdam has even three distinct zipcode areas in the top 10. If we would use these in-sample generated zipcode encodings in our model, we would make the unforgivable mistake of information leakage. The house price of each house is used to generate a feature that is used to predict that same house price. This is where the Taxonomy Encoder comes into play.\n\n\n\n\n\n\n\nTable 1: Encoded zip codes for areas with most expensive houses\n\n\n\nzipcode\ncity\n\n\n\nmin\nmax\nnunique\nunique\n\n\nencoded_mean\n\n\n\n\n\n\n\n\n1,340,222\n2243\n2244\n2\n[Wassenaar]\n\n\n1,276,450\n2111\n2111\n1\n[Aerdenhout]\n\n\n1,221,419\n1077\n1077\n1\n[Amsterdam]\n\n\n1,138,075\n1358\n1358\n1\n[Almere]\n\n\n1,015,257\n1071\n1071\n1\n[Amsterdam]\n\n\n995,357\n3546\n3546\n1\n[Utrecht]\n\n\n881,411\n2051\n2061\n2\n[Overveen, Bloemendaal]\n\n\n802,266\n1026\n1028\n3\n[Amsterdam]\n\n\n747,447\n2106\n2106\n1\n[Heemstede]\n\n\n691,614\n1251\n1272\n6\n[Laren, Blaricum, Huizen]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTable 2: Encoded zip codes for areas with least expensive houses\n\n\n\nzipcode\ncity\n\n\n\nmin\nmax\nnunique\nunique\n\n\nencoded_mean\n\n\n\n\n\n\n\n\n205,692\n6369\n6511\n43\n[Heerlen, Simpelveld, Landgraaf, nan, Hoensbroek, Amstenrade, Oirsbeek, Doenrade, Brunssum, Merkelbeek, Schinveld, Jabeek, Bingelrade, Kerkrade, Eygelshoven, Nijmegen]\n\n\n234,407\n3066\n3119\n30\n[Rotterdam, Schiedam]\n\n\n244,079\n2511\n2547\n24\n[Den Haag]\n\n\n247,141\n9541\n9999\n254\n[Groningen, Lauwersoog, Hornhuizen, Vlagtwedde, Bourtange, Sellingen, Ter Apel, Ter Apelkanaal, Zandberg, Veelerveen, 2e Exloërmond, 1e Exloërmond, Exloërveen, Musselkanaal, Mussel, Vledderveen, O...\n\n\n269,679\n4331\n4707\n164\n[Terneuzen, Nieuw Namen, Kloosterzande, Steenbergen, Oud-Vossemeer, Middelburg, Nieuw- en Sint Joosland, Arnemuiden, Veere, Gapinge, Serooskerke, Vrouwenpolder, Oostkapelle, Domburg, Westkapelle, ...\n\n\n281,335\n8574\n9301\n336\n[Workum, Oosterend, Grou, Drachten, Bakhuizen, Elahuizen, Oudega, Kolderwolde, Hemelum, Sneek, Gaastmeer, Idzega, Sandfirden, Blauwhuis, Westhem, Abbega, Oosthem, Heeg, Hommerts, Jutrijp, Uitwelli...\n\n\n283,546\n5851\n6367\n218\n[Susteren, Geleen, Eys, Afferden L, Siebengewald, Bergen L, Well L, Wellerlooi, Wanssum, Geijsteren, Blitterswijck, Meerlo, Tienray, Swolgen, Broekhuizenvorst, Broekhuizen, Venlo, Tegelen, Steyl, ...\n\n\n301,971\n3121\n3443\n152\n[Hellevoetsluis, Numansdorp, Dordrecht, Schiedam, Vlaardingen, Maassluis, Hoek van Holland, Maasland, Rhoon, Poortugaal, Rozenburg, Hoogvliet Rotterdam, Pernis Rotterdam, Spijkenisse, Hekelingen, ...\n\n\n307,961\n3551\n3565\n10\n[Utrecht]\n\n\n317,922\n1273\n1357\n39\n[Almere, Huizen]" + "text": "House price prediction\nWhen predicting house prices, zipcode is an example of a taxonomy bearing feature where we need different granularity for encoding the feature in different regions. Figure 1 shows average house prices in the Netherlands in 2023 per zipcode (source: CBS). We know that in cities house prices in different zones differ a lot, even if the zones are only a few kilometers away. In more rural areas, these differences are often much less prevalent. To illustrate this, Figure 2 zooms in on Amsterdam (a) and the province of Limburg (b). Amsterdam is a relatively small city that has zones with house prices in the lower end and the most expensive houses in the country. The province of Limburg is a much larger area but has significantly less variation in house prices. Going back to our aim of encoding the zipcode feature; we need different granularity for cities than for the country side. The question is how to choose this granularity.\n\n\n\n\n\nFigure 1: House price per zip code in the Netherlands\n\n\n\n\n\n\n\n\n\n\n\n(a) City of Amsterdam\n\n\n\n\n\n\n\n(b) Province of Limburg\n\n\n\n\nFigure 2: House price average per zip code, note the difference in price homogeneity in cities vs rural areas\n\n\nLet’s use a decision tree regressor to create segments of zipcodes that are homogenous with respect to mean house prices. As I’m lacking a dataset with house prices of individual houses, I’m going to create such dataset by concatenating the CBS dataset 10 times and multiply house prices with a random factor between 0.9 and 1.1 to introduce some variation. The decision tree regressor is fitted with max_leaf_nodes set to 50. This means that the zipcodes will be placed in 50 segments. To illustrate the effectiveness of this method, I show in-sample predictions for the most expensive (Table 1) and the least expensive areas (Table 2). The two tables show encoded mean house prices, the range of zipcodes, the number of zipcodes in that range (apparently not all possible values are used as zipcodes!) and the cities where these zipcodes are in. Clearly, the most expensive areas are much smaller and require a higher granularity of zipcode encoding than areas with lower house prices. Note how Amsterdam has even three distinct zipcode areas in the country top 10. If we use these in-sample generated zipcode encodings in our model, we would make the mistake of information leakage. The house price of each house would be used to generate a feature that is used to predict that same house price. This is where the Taxonomy Encoder comes into play.\n\n\n\n\n\n\n\nTable 1: Encoded zip codes for areas with most expensive houses\n\n\n\nzipcode\ncity\n\n\n\nmin\nmax\nnunique\nunique\n\n\nencoded_mean\n\n\n\n\n\n\n\n\n1,340,222\n2243\n2244\n2\n[Wassenaar]\n\n\n1,276,450\n2111\n2111\n1\n[Aerdenhout]\n\n\n1,221,419\n1077\n1077\n1\n[Amsterdam]\n\n\n1,138,075\n1358\n1358\n1\n[Almere]\n\n\n1,015,257\n1071\n1071\n1\n[Amsterdam]\n\n\n995,357\n3546\n3546\n1\n[Utrecht]\n\n\n881,411\n2051\n2061\n2\n[Overveen, Bloemendaal]\n\n\n802,266\n1026\n1028\n3\n[Amsterdam]\n\n\n747,447\n2106\n2106\n1\n[Heemstede]\n\n\n691,614\n1251\n1272\n6\n[Laren, Blaricum, Huizen]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTable 2: Encoded zip codes for areas with least expensive houses\n\n\n\nzipcode\ncity\n\n\n\nmin\nmax\nnunique\nunique\n\n\nencoded_mean\n\n\n\n\n\n\n\n\n205,692\n6369\n6511\n43\n[Heerlen, Simpelveld, Landgraaf, nan, Hoensbroek, Amstenrade, Oirsbeek, Doenrade, Brunssum, Merkelbeek, Schinveld, Jabeek, Bingelrade, Kerkrade, Eygelshoven, Nijmegen]\n\n\n234,407\n3066\n3119\n30\n[Rotterdam, Schiedam]\n\n\n244,079\n2511\n2547\n24\n[Den Haag]\n\n\n247,141\n9541\n9999\n254\n[Groningen, Lauwersoog, Hornhuizen, Vlagtwedde, Bourtange, Sellingen, Ter Apel, Ter Apelkanaal, Zandberg, Veelerveen, 2e Exloërmond, 1e Exloërmond, Exloërveen, Musselkanaal, Mussel, Vledderveen, O...\n\n\n269,679\n4331\n4707\n164\n[Terneuzen, Nieuw Namen, Kloosterzande, Steenbergen, Oud-Vossemeer, Middelburg, Nieuw- en Sint Joosland, Arnemuiden, Veere, Gapinge, Serooskerke, Vrouwenpolder, Oostkapelle, Domburg, Westkapelle, ...\n\n\n281,335\n8574\n9301\n336\n[Workum, Oosterend, Grou, Drachten, Bakhuizen, Elahuizen, Oudega, Kolderwolde, Hemelum, Sneek, Gaastmeer, Idzega, Sandfirden, Blauwhuis, Westhem, Abbega, Oosthem, Heeg, Hommerts, Jutrijp, Uitwelli...\n\n\n283,546\n5851\n6367\n218\n[Susteren, Geleen, Eys, Afferden L, Siebengewald, Bergen L, Well L, Wellerlooi, Wanssum, Geijsteren, Blitterswijck, Meerlo, Tienray, Swolgen, Broekhuizenvorst, Broekhuizen, Venlo, Tegelen, Steyl, ...\n\n\n301,971\n3121\n3443\n152\n[Hellevoetsluis, Numansdorp, Dordrecht, Schiedam, Vlaardingen, Maassluis, Hoek van Holland, Maasland, Rhoon, Poortugaal, Rozenburg, Hoogvliet Rotterdam, Pernis Rotterdam, Spijkenisse, Hekelingen, ...\n\n\n307,961\n3551\n3565\n10\n[Utrecht]\n\n\n317,922\n1273\n1357\n39\n[Almere, Huizen]" }, { "objectID": "posts/hyperparameter_tuning_spark.html#introduction", @@ -138,5 +138,12 @@ "title": "About", "section": "", "text": "I work as a data scientist for a financial institution. My main topics of interest are entity resolution, fuzzy matching, classification for imbalanced data problems and aggregation learning.\nSome of the libraries I created or co-created:\n\nDeduplipy - Entity resolution package (deduplipy.com, GitHub, PyData Global presentation)\n\nSpark-Matcher - Entity resolution and fuzzy matching at scale in Spark (GitHub)\nPyMinHash - Minhashing in Python (GitHub)\nOther:\n\nLockdownRadar.nl (newspaper article)" + }, + { + "objectID": "posts/taxonomy_encoder_blog.html#final-remark", + "href": "posts/taxonomy_encoder_blog.html#final-remark", + "title": "Taxonomy feature encoding", + "section": "Final remark", + "text": "Final remark\nIt’s important to shuffle your data before using the TaxonomyEncoder. The reason is that the cross_val_predict uses (Stratified)KFold for cv splitting and the function doesn’t allow for usage of any cv-splitter that involves shuffling. If your data is sorted by the taxonmy bearing feature, the out of sample prediction will be useless. The alternative solution is to extend the TaxonomyEncoder with a custom cv splitter that implements shuffling." } ] \ No newline at end of file