improve taxonomy encoder blogpost

fritshermans · Feb 26, 2024 · 8273c73 · 8273c73
1 parent 3e43cb1
commit 8273c73
Show file tree

Hide file tree

Showing 5 changed files with 24 additions and 12 deletions.
diff --git a/docs/index.html b/docs/index.html
@@ -215,7 +215,7 @@ <h3 class="no-anchor listing-title">
 </div>
 <div class="quarto-post image-right" data-index="1" data-listing-file-modified-sort="1681729088162" data-listing-reading-time-sort="3">
 <div class="thumbnail">
-<p><a href="./posts/Deduplipy.html"> <p class="card-img-top"><img src="posts/Deduplipy_files/figure-html/9e137887-1-56aaf717-0970-440c-b3f9-0aec94500f2d.png"  class="thumbnail-image card-img"/></p> </a></p>
+<p><a href="./posts/Deduplipy.html"> <p class="card-img-top"><img src="posts/Deduplipy_files/figure-html/957bd8db-1-56aaf717-0970-440c-b3f9-0aec94500f2d.png"  class="thumbnail-image card-img"/></p> </a></p>
 </div>
 <div class="body">
 <a href="./posts/Deduplipy.html">
@@ -276,7 +276,7 @@ <h3 class="no-anchor listing-title">
 <a href="./posts/PyMinhash.html"> </a>
 </div>
 </div>
-<div class="quarto-post image-right" data-index="4" data-listing-file-modified-sort="1708871784262" data-listing-reading-time-sort="5">
+<div class="quarto-post image-right" data-index="4" data-listing-file-modified-sort="1708929081866" data-listing-reading-time-sort="6">
 <div class="thumbnail">
 <p><a href="./posts/taxonomy_encoder_blog.html"> <p class="card-img-top"><img src="posts/taxonomy_encoder_blog_files/figure-html/fig-ih-output-1.png"  class="thumbnail-image card-img"/></p> </a></p>
 </div>
@@ -289,7 +289,7 @@ <h3 class="no-anchor listing-title">
 
 </div>
 <div class="listing-description">
-Features like zipcodes or industry codes contain information that is part of a taxomy. Although these feature values might be numerical, it doesn’t necessarily make sense to…
+Features like zipcodes or industry codes (NAICS, MCC) contain information that is part of a taxomy. Although these feature values are numerical, it doesn’t necessarily make…
 </div>
 </a>
 </div>

diff --git a/docs/posts/Deduplipy.html b/docs/posts/Deduplipy.html
@@ -183,7 +183,7 @@ <h1 class="title">Deduplication of records using DedupliPy</h1>
 
 </header>
 
-<p><img src="Deduplipy_files/figure-html/9e137887-1-56aaf717-0970-440c-b3f9-0aec94500f2d.png" class="img-fluid"></p>
+<p><img src="Deduplipy_files/figure-html/957bd8db-1-56aaf717-0970-440c-b3f9-0aec94500f2d.png" class="img-fluid"></p>
 <p>Deduplication or entity resolution is the task to combine different representations of the same real world entity. The Python package DedupliPy implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset. In this post I demonstrate how the package works and show more advanced settings. In case you want to apply entity resolution on large data in Spark, please have a look at <a href="https://github.com/ing-bank/spark-matcher">Spark-Matcher</a>, a package I developed together with two colleagues.</p>
 <section id="installation" class="level2">
 <h2 class="anchored" data-anchor-id="installation">Installation</h2>

diff --git a/...-56aaf717-0970-440c-b3f9-0aec94500f2d.png → ...-56aaf717-0970-440c-b3f9-0aec94500f2d.png b/...-56aaf717-0970-440c-b3f9-0aec94500f2d.png → ...-56aaf717-0970-440c-b3f9-0aec94500f2d.png
diff --git a/docs/posts/taxonomy_encoder_blog.html b/docs/posts/taxonomy_encoder_blog.html
@@ -162,6 +162,7 @@ <h2 id="toc-title">On this page</h2>
   <li><a href="#taxonomy-encoder" id="toc-taxonomy-encoder" class="nav-link" data-scroll-target="#taxonomy-encoder">Taxonomy Encoder</a></li>
   <li><a href="#hyperparameter-tuning" id="toc-hyperparameter-tuning" class="nav-link" data-scroll-target="#hyperparameter-tuning">Hyperparameter tuning</a></li>
   <li><a href="#taxonomy-encoder-for-binary-classification" id="toc-taxonomy-encoder-for-binary-classification" class="nav-link" data-scroll-target="#taxonomy-encoder-for-binary-classification">Taxonomy Encoder for binary classification</a></li>
+  <li><a href="#final-remark" id="toc-final-remark" class="nav-link" data-scroll-target="#final-remark">Final remark</a></li>
   </ul>
 </nav>
     </div>
@@ -187,12 +188,12 @@ <h1 class="title">Taxonomy feature encoding</h1>
 
 <section id="introduction" class="level2">
 <h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
-<p>Features like zipcodes or industry codes contain information that is part of a taxomy. Although these feature values might be numerical, it doesn’t necessarily make sense to use them as ordinal features; a region’s zipcode might be a higher value than another region’s zipcode, that doesn’t mean that there is valuable ranking in these values. If we encode taxonomy bearing features use One-Hot-Encoding, the number of features blows up tremendously. Moreover, we loose helpful information on the similarity of adjacent values. E.g. the MCC industry codes for ‘Commerical clothing’ (5137) and ‘Commerical footwear’ (5139) are clearly more similar than for example ‘Child Care services’ (8351). We could overcome this issue by One-Hot-Encoding at a higher level (e.g.&nbsp;5xxx for ‘Stores’ and 8xxx for ‘Professional Services and Membership Organizations’) but dependent on the modelling task, we might want to have higher granularity in specific parts of the possible feature values.</p>
+<p>Features like zipcodes or industry codes (NAICS, MCC) contain information that is part of a taxomy. Although these feature values are numerical, it doesn’t necessarily make sense to use them as ordinal features; a region’s zipcode might be a higher value than another region’s zipcode, that doesn’t mean that there is valuable ranking in these values. If we encode taxonomy bearing features using One-Hot-Encoding, the number of features blows up tremendously. Moreover, we loose helpful information on the similarity of adjacent values. E.g. the MCC industry codes for ‘Commerical clothing’ (5137) and ‘Commerical footwear’ (5139) are clearly more similar than for example ‘Child Care services’ (8351). We could overcome this issue by One-Hot-Encoding at a higher level (e.g.&nbsp;5xxx for ‘Stores’ and 8xxx for ‘Professional Services and Membership Organizations’) but dependent on the modelling task, we might want to have higher granularity in specific parts of the possible feature value range.</p>
 <p>To overcome these issues, I created the Taxonomy Encoder for scikit-learn. Before going to the implementation, let’s first have a look at a practical application; house price prediction in the Netherlands.</p>
 </section>
 <section id="house-price-prediction" class="level2">
 <h2 class="anchored" data-anchor-id="house-price-prediction">House price prediction</h2>
-<p>When predicting house prices, zipcode is an example of a taxonomy bearing feature where we need different granularity for encoding the feature in different regions. <a href="#fig-ih">Figure&nbsp;1</a> shows average house prices in the Netherlands in 2023 per zipcode (source: <a href="https://www.cbs.nl">CBS</a>). We know from experience that in cities house prices in different zones can differ a lot, even if the zones are only a few kilometers away. In more rural areas, these differences are often much less prevalent. To illustrate this, <a href="#fig-ih2">Figure&nbsp;2</a> zooms in on Amsterdam (a) and the province of Limburg (b). Amsterdam is a relatively small city that has zones with house prices in the lower end and the most expensive house in the country. The province of Limburg is a much larger area but has significantly less variation in house prices. Going back to our aim of encoding the zipcode feature; we need different granularity for cities than for the country side. The question is how to choose this granularity.</p>
+<p>When predicting house prices, zipcode is an example of a taxonomy bearing feature where we need different granularity for encoding the feature in different regions. <a href="#fig-ih">Figure&nbsp;1</a> shows average house prices in the Netherlands in 2023 per zipcode (source: <a href="https://www.cbs.nl">CBS</a>). We know that in cities house prices in different zones differ a lot, even if the zones are only a few kilometers away. In more rural areas, these differences are often much less prevalent. To illustrate this, <a href="#fig-ih2">Figure&nbsp;2</a> zooms in on Amsterdam (a) and the province of Limburg (b). Amsterdam is a relatively small city that has zones with house prices in the lower end and the most expensive houses in the country. The province of Limburg is a much larger area but has significantly less variation in house prices. Going back to our aim of encoding the zipcode feature; we need different granularity for cities than for the country side. The question is how to choose this granularity.</p>
 <div class="cell" data-execution_count="9">
 <div class="cell-output cell-output-display">
 <div id="fig-ih" class="quarto-figure quarto-figure-center anchored">
@@ -226,7 +227,7 @@ <h2 class="anchored" data-anchor-id="house-price-prediction">House price predict
 <p></p><figcaption class="figure-caption">Figure&nbsp;2: House price average per zip code, note the difference in price homogeneity in cities vs rural areas</figcaption><p></p>
 </figure>
 </div>
-<p>Let’s use a decision tree regressor to create segments of zipcodes that are homogenous with respect to mean house prices. As I’m lacking a dataset with house prices of <em>individual</em> houses, I’m going to create such dataset by concatenating the CBS dataset 10 times and multiply house prices with a random factor between 0.9 and 1.1 to introduce some variation. The decision tree regressor is fitted with <code>max_leaf_nodes</code> set to 50. This means that the zipcodes will be placed in 50 segments. To illustrate the effectiveness of this method, I show in-sample predictions for the most expensive (<a href="#tbl-expensive">Table&nbsp;1</a>) and the least expensive areas (<a href="#tbl-cheap">Table&nbsp;2</a>). The two tables show encoded mean house prices, the range of zipcodes, the number of zipcodes in that range (apparently not all values between 1000 and 9999 are used as zipcodes!) and the cities where these zipcodes are in. Clearly, the most expensive areas are much smaller and require a higher granularity of zipcode encoding than areas with lower house prices. Note how Amsterdam has even three distinct zipcode areas in the top 10. If we would use these in-sample generated zipcode encodings in our model, we would make the unforgivable mistake of information leakage. The house price of each house is used to generate a feature that is used to predict that same house price. This is where the Taxonomy Encoder comes into play.</p>
+<p>Let’s use a decision tree regressor to create segments of zipcodes that are homogenous with respect to mean house prices. As I’m lacking a dataset with house prices of <em>individual</em> houses, I’m going to create such dataset by concatenating the CBS dataset 10 times and multiply house prices with a random factor between 0.9 and 1.1 to introduce some variation. The decision tree regressor is fitted with <code>max_leaf_nodes</code> set to 50. This means that the zipcodes will be placed in 50 segments. To illustrate the effectiveness of this method, I show in-sample predictions for the most expensive (<a href="#tbl-expensive">Table&nbsp;1</a>) and the least expensive areas (<a href="#tbl-cheap">Table&nbsp;2</a>). The two tables show encoded mean house prices, the range of zipcodes, the number of zipcodes in that range (apparently not all possible values are used as zipcodes!) and the cities where these zipcodes are in. Clearly, the most expensive areas are much smaller and require a higher granularity of zipcode encoding than areas with lower house prices. Note how Amsterdam has even three distinct zipcode areas in the country top 10. If we use these in-sample generated zipcode encodings in our model, we would make the mistake of information leakage. The house price of each house would be used to generate a feature that is used to predict that same house price. This is where the Taxonomy Encoder comes into play.</p>
 <div class="cell" data-scrolled="true" data-tags="[]" data-execution_count="18">
 <div class="cell-output cell-output-display" data-execution_count="18">
 <div>
@@ -476,7 +477,7 @@ <h2 class="anchored" data-anchor-id="house-price-prediction">House price predict
 </section>
 <section id="taxonomy-encoder" class="level1">
 <h1>Taxonomy Encoder</h1>
-<p>The Taxonomy Encoder is a type of target encoder, as for example implemented in scikit-learn. A plain vanilla target encoder encodes a feature value by the target mean value for all samples within that category. When training a model, it’s important that out-of-sample target encodings are used to prevent information leakage. This is achieved by internal using cross-fitting and prediction. We’re going to apply the same logic in the Taxonomy Encoder. The difference with a normal target encoder is that we don’t take encode the feature by taking the mean of the target for all samples within the same category, but we take the decision tree prediction instead. By setting the maximum numer of leafs <code>max_leaf_nodes</code> we choose how many segments the decision tree will create. More nodes mean higher granularity but could also result in overfitting. When there are only a small amount of samples in a particular segment, we might want that segment to be merged with another segment; we set this by <code>min_samples_leaf</code> - again to avoid overfitting. The implementation of the Taxonomy Encoder is rather straightforward:</p>
+<p>The Taxonomy Encoder is a type of target encoder, as for example <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#sklearn.preprocessing.TargetEncoder">implemented</a> in scikit-learn. A plain vanilla target encoder encodes a feature value by the target mean value for all samples within that category. When training a model, it’s important that out-of-sample target encodings are used to prevent information leakage. This is achieved by internal using cross-fitting and prediction. We’re going to apply the same logic in the Taxonomy Encoder. The difference with a normal target encoder is that we don’t take encode the feature by taking the mean of the target for all samples within the same category, but we take the decision tree prediction instead. By setting the maximum numer of leafs <code>max_leaf_nodes</code> we choose how many segments the decision tree will create. More nodes mean higher granularity but could also result in overfitting. When there are only a small amount of samples in a particular segment, we might want that segment to be merged with another segment; we set this by <code>min_samples_leaf</code> - again to avoid overfitting. The implementation of the Taxonomy Encoder is rather straightforward:</p>
 <div class="cell" data-execution_count="46">
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.base <span class="im">import</span> TransformerMixin, BaseEstimator</span>
 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.model_selection <span class="im">import</span> cross_val_predict</span>
@@ -592,7 +593,7 @@ <h1>Taxonomy Encoder</h1>
 </section>
 <section id="hyperparameter-tuning" class="level1">
 <h1>Hyperparameter tuning</h1>
-<p>The TaxonomyEncoder introduces an additional hyperparameter to tune: the number of segments in which the zipcodes are combined. Below are plots of different values for <code>max_leaf_nodes</code> for encoding zipcodes in our house price prediction example:</p>
+<p>The TaxonomyEncoder introduces an additional hyperparameter to tune: the number of segments in which the zipcodes are combined. Below are plots of different values for <code>max_leaf_nodes</code> for encoding zipcodes in our house price prediction example. Notice how only a large number of leafs results in segmentation of the east side of the country.</p>
 <div id="fig-ih3" class="cell quarto-layout-panel" data-execution_count="34">
 <figure class="figure">
 <div class="quarto-layout-row quarto-layout-valign-top">
@@ -689,6 +690,10 @@ <h1>Taxonomy Encoder for binary classification</h1>
 <span id="cb10-30"><a href="#cb10-30" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> get_feature_names_out(<span class="va">self</span>, input_features):</span>
 <span id="cb10-31"><a href="#cb10-31" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> [<span class="ss">f"txe_</span><span class="sc">{</span>x<span class="sc">}</span><span class="ss">"</span> <span class="cf">for</span> x <span class="kw">in</span> input_features]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
+</section>
+<section id="final-remark" class="level1">
+<h1>Final remark</h1>
+<p>It’s important to shuffle your data before using the TaxonomyEncoder. The reason is that the <code>cross_val_predict</code> uses (Stratified)KFold for cv splitting and the function doesn’t allow for usage of any cv-splitter that involves shuffling. If your data is sorted by the taxonomy bearing feature, the <code>cross_val_predict</code> values will be useless. When we set <code>max_leaf_nodes</code> to 3 in our example, the houseprices of zipcodes 7000-9999 would be encoded based on the zipcodes 1000-7000 etc. The alternative solution is to extend the TaxonomyEncoder with a custom cv splitter that implements shuffling.</p>
 
 
 </section>