Update industry rerun information in readmes etc

nestauk · Nov 18, 2024 · 3a79dea · 3a79dea
1 parent d7e17a8
commit 3a79dea
Show file tree

Hide file tree

Showing 4 changed files with 42 additions and 1 deletion.
diff --git a/dap_prinz_green_jobs/pipeline/ojo_application/flows/README.md b/dap_prinz_green_jobs/pipeline/ojo_application/flows/README.md
@@ -37,3 +37,21 @@ Installing faiss on the EC2 machine was hard, here is a log of what was done, al
 conda install -c pytorch faiss-cpu=1.7.4 mkl=2021 blas=1.0=mkl
 pip install faiss-cpu
 ```
+
+# Updating the data
+
+## Industry measures
+
+To get the industry measures for just the newest set of job adverts, and then to merge them with pre-calculated industry measures for older job adverts, update the file directories at the top of these scripts as needed (`green_ind_existing_data_dir` and `new_ojo_descriptions_dir`) and run:
+
+```
+python dap_prinz_green_jobs/pipeline/ojo_application/flows/ojo_industry_measures_update.py --production
+
+```
+
+On EC2 this took 45 hours.
+
+This produces two files of interest in the `s3://prinz-green-jobs/outputs/data/ojo_application/extracted_green_measures/[DATE_RUN]/` folder:
+
+1. `ojo_newest_industry_green_measures_production_True.csv`: the extracted green industry measures for the new job adverts, for `20241115` this was 1,313,447 job adverts.
+2. `ojo_all_industry_green_measures_production_True.csv`: a merged file of industry measures for all the new and old job adverts, for `20241115` this was 5,967,229 job adverts.
diff --git a/dap_prinz_green_jobs/pipeline/ojo_application/flows/ojo_industry_measures_update.py b/dap_prinz_green_jobs/pipeline/ojo_application/flows/ojo_industry_measures_update.py
@@ -180,7 +180,11 @@ def write_polars_s3(df, destination):
 
     # Join with the existing green industry measures
 
+    inds_measures_df["INDUSTRY GHG PER UNIT EMISSIONS"] = inds_measures_df[
+        "INDUSTRY GHG PER UNIT EMISSIONS"
+    ].astype(str)
     inds_measures_pl = pl.from_pandas(inds_measures_df)
+
     all_inds_measures_df = pl.concat(
         [green_ind_existing_data, inds_measures_pl], how="vertical_relaxed"
     )

diff --git a/dap_prinz_green_jobs/pipeline/ojo_application/ojo_sample/README.md b/dap_prinz_green_jobs/pipeline/ojo_application/ojo_sample/README.md
@@ -18,7 +18,7 @@ To generate deduplicated datasets for:
 
 - A random small sample (100,000);
 - An engineered 'green' sample based on keywords (100,000) and;
-- The final random sample of 1,000,000 job ads 
+- The final random sample of 1,000,000 job ads
 
 run:
 
@@ -38,3 +38,21 @@ There are multiple main files of interest:
 All files contain the job advert descriptions, the location, date, and the job title. The `large_ojo_sample.csv` file will be our final sample to run our models on for downstream analysis.
 
 Since green jobs are rare in our dataset, the green sample was generated as a way to test out our approaches on jobs that are likely to be green. We recognise the keyword search created dataset is not a conclusive list, will pick up false positives, and will also miss many green jobs. This list or the way it's been generated will not be used to make any comment on greenness - just as a useful dataset for this projects development.
+
+## Refreshing the data
+
+When there is new OJO data added you can run the deduplication and filtering steps on the new batch of job adverts by first editing and then running:
+
+```
+python dap_prinz_green_jobs/pipeline/ojo_application/ojo_sample/data_refresh.py
+
+```
+
+The file paths at the top of this script will need editing to the latest data locations for the full OJO data.
+
+The datasets will then be saved out to a datestamped folder, e.g. `s3://prinz-green-jobs/outputs/data/ojo_application/deduplicated_sample/20241114/`. This will just contain the data for the newest job adverts.
+
+| Date ran | Min created date | Max created data | Number of unique job adverts in deduplicated data | Job ids data location                                                                                      |
+| -------- | ---------------- | ---------------- | ------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
+| 20240213 | 11/12/2020       | 06/11/2023       | 4,653,782                                         | s3://prinz-green-jobs/outputs/ data/ojo_application/deduplicated_sample/ deduplicated_job_ids.csv          |
+| 20241114 | 07/11/2023       | 05/11/2024       | 1,313,447                                         | s3://prinz-green-jobs/outputs/ data/ojo_application/deduplicated_sample/ 20241114/deduplicated_job_ids.csv |
diff --git a/requirements.txt b/requirements.txt
@@ -16,3 +16,4 @@ typing_extensions<4.6.0
 umap-learn
 geopandas
 nx_altair
+polars-lts-cpu==1.5.0