Skip to content

Commit

Permalink
Update industry rerun information in readmes etc
Browse files Browse the repository at this point in the history
  • Loading branch information
lizgzil committed Nov 18, 2024
1 parent d7e17a8 commit 3a79dea
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 1 deletion.
18 changes: 18 additions & 0 deletions dap_prinz_green_jobs/pipeline/ojo_application/flows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,21 @@ Installing faiss on the EC2 machine was hard, here is a log of what was done, al
conda install -c pytorch faiss-cpu=1.7.4 mkl=2021 blas=1.0=mkl
pip install faiss-cpu
```

# Updating the data

## Industry measures

To get the industry measures for just the newest set of job adverts, and then to merge them with pre-calculated industry measures for older job adverts, update the file directories at the top of these scripts as needed (`green_ind_existing_data_dir` and `new_ojo_descriptions_dir`) and run:

```
python dap_prinz_green_jobs/pipeline/ojo_application/flows/ojo_industry_measures_update.py --production
```

On EC2 this took 45 hours.

This produces two files of interest in the `s3://prinz-green-jobs/outputs/data/ojo_application/extracted_green_measures/[DATE_RUN]/` folder:

1. `ojo_newest_industry_green_measures_production_True.csv`: the extracted green industry measures for the new job adverts, for `20241115` this was 1,313,447 job adverts.
2. `ojo_all_industry_green_measures_production_True.csv`: a merged file of industry measures for all the new and old job adverts, for `20241115` this was 5,967,229 job adverts.
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,11 @@ def write_polars_s3(df, destination):

# Join with the existing green industry measures

inds_measures_df["INDUSTRY GHG PER UNIT EMISSIONS"] = inds_measures_df[
"INDUSTRY GHG PER UNIT EMISSIONS"
].astype(str)
inds_measures_pl = pl.from_pandas(inds_measures_df)

all_inds_measures_df = pl.concat(
[green_ind_existing_data, inds_measures_pl], how="vertical_relaxed"
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ To generate deduplicated datasets for:

- A random small sample (100,000);
- An engineered 'green' sample based on keywords (100,000) and;
- The final random sample of 1,000,000 job ads
- The final random sample of 1,000,000 job ads

run:

Expand All @@ -38,3 +38,21 @@ There are multiple main files of interest:
All files contain the job advert descriptions, the location, date, and the job title. The `large_ojo_sample.csv` file will be our final sample to run our models on for downstream analysis.

Since green jobs are rare in our dataset, the green sample was generated as a way to test out our approaches on jobs that are likely to be green. We recognise the keyword search created dataset is not a conclusive list, will pick up false positives, and will also miss many green jobs. This list or the way it's been generated will not be used to make any comment on greenness - just as a useful dataset for this projects development.

## Refreshing the data

When there is new OJO data added you can run the deduplication and filtering steps on the new batch of job adverts by first editing and then running:

```
python dap_prinz_green_jobs/pipeline/ojo_application/ojo_sample/data_refresh.py
```

The file paths at the top of this script will need editing to the latest data locations for the full OJO data.

The datasets will then be saved out to a datestamped folder, e.g. `s3://prinz-green-jobs/outputs/data/ojo_application/deduplicated_sample/20241114/`. This will just contain the data for the newest job adverts.

| Date ran | Min created date | Max created data | Number of unique job adverts in deduplicated data | Job ids data location |
| -------- | ---------------- | ---------------- | ------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| 20240213 | 11/12/2020 | 06/11/2023 | 4,653,782 | s3://prinz-green-jobs/outputs/ data/ojo_application/deduplicated_sample/ deduplicated_job_ids.csv |
| 20241114 | 07/11/2023 | 05/11/2024 | 1,313,447 | s3://prinz-green-jobs/outputs/ data/ojo_application/deduplicated_sample/ 20241114/deduplicated_job_ids.csv |
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ typing_extensions<4.6.0
umap-learn
geopandas
nx_altair
polars-lts-cpu==1.5.0

0 comments on commit 3a79dea

Please sign in to comment.