To install the library in a dedicated virtual environnement :
python3 -m venv venv
source venv/bin/activate
python3 -m pip install git+https://github.com/dataforgoodfr/12_taxobservatory.git
To run the report downloader from the command line, you can invoke the pdf_downloader
module:
python3 -m collecte.pdf_downloader company_names.csv
In addition, multiple optional parameters can be tuned. To know how to use them, you can check the help manual:
python3 -m collecte.pdf_downloader --help
A more complete example could be
python3 -m collecte.pdf_downloader company_names.csv --search_keywords "tax country by country reporting GRI 207-4" --dest_dirpath try_pdf_downloads --url_cache_filepath pdf_url_cache.pkl --fetch_timeout_s 60 --debug
The execution of this module requires a Google JSON API key as well as a search engine ID (or CX code). These must be specified in the .env
file (a sample file being .env.example
) :
# Required for fetching URLs with the Google JSON API
GOOGLE_API_KEY=CHANGE_ME
GOOGLE_CX=CHANGE_ME
If the pipeline runs successfully, the results folder should contain the following elements:
- a collection of company-named folders, each containing one or multiple PDFs
- a log file
run_pdf_downloader_DD_MM_YYYY_hh_mm_ss.log
storing all the runtime logging entries - a CSV file
download_data.csv
listing all the downloaded company reports and their URLs - a CSV file
missing_data.csv
listing all the missing company reports and their URLs (if some were found), plus the type of error that prevented their download
A collection of company names is made available for user convenience at test/data/company_names.csv
.
To start the streamlit app and use the extractor streamlined version, start it locally by running
streamlit run app/index.py
The app comes with page detection and parsers default config but you can change it by providing a yaml file following the config.yaml format below.
Below is an example of the pipeline running on one of the reports, parsing the tables with LlamaParse and Unstructured.
PipelineDemonstration.webm
To run the pipeline from the command line, once installed, you can invoke the
country_by_country
module on a pdf file as :
python3 -m country_by_country config.yaml report.pdf
The yaml file is describing the pipeline you want to execute. For now, you can
specify the page filter and the table extraction algorithms. An example
config.yaml
file is given below :
config.yaml
pagefilter:
type: RFClassifier
params:
modelfile: random_forest_model_low_false_positive.joblib
table_extraction:
img:
- type: Camelot
params:
flavor: stream
- type: Unstructured
params:
pdf_image_dpi: 300
hi_res_model_name: "yolox"
table_cleaning:
- type: LLM
params:
openai_model: "gpt-4-turbo-preview"
This config file uses:
- a pretrained random forest for selecting the pages of the report that possibly contain a CbCR table
- camelot with its stream flavor and unstructured with yolox as the table detector for locating and parsing the tables on the previously selected pages
- LangChain with GPT-4-turbo-preview for requesting the parsed tables to extract and re-order the necessary informations
A page filter takes as input a pdf filepath and fills in the assets under the
key pagefilter
:
src_pdf
: the path to the original pdfselected_pages
: the list of indices of the selected pages. The indices are 0 based.
The available filters are :
This filter does not perform any selection on the input document and just copy the whole content as is.
This filter expects the pages to extract from the input filename either as a single page number or a page range. Valid names are given below :
arbitrarily_long_and_cumBerSOME_prefix_PAGENUMBER.pdf
: gets the page numbered PAGENUMBERarbitrarily_long_and_cumBerSOME_prefix_PAGENUMBER1-PAGENUMBER2.pdf
: gets the range [PAGENUMBER1, PAGENUMBER2]
This filter uses a random forest trained to identify the pages from the text the pages content. Several features are used to identify relevant pages such as :
- the number of country names listed in the page
- the presence of keywords such as "tax", "countr", "report", "cbc", .."
We allow multiple table extraction algorithms to be used simultaneously. This is
the reason why the table_extraction
key of the config.yaml
is a list. A
table extraction algorithm fills in the assets under the key
table_extractors
, which is a list containing the assets for every algorithm
you considered. Every algorithm provides the following assets :
id
: a unique identifier for this algorithmtype
: the algorithm type, can be any of the listed algorithms belowcamelot
,unstructured
,unstructured_api
,llama_parse
params
: the named parameters and their values given to the construction of the algorithmtables
: the list of extracted tables as pandas dataframes
The following table extractors can be considered :
ExtractTable is provided for legacy/benchmarking purpose. The ExtractTable python module is no more maintained but this was originally the package used to extract data from PDF tables.
You can use by specifying in the config.yaml
:
table_extraction:
- type: ExtractTableAPI
It requires an API key to be defined in your .env
file :
# Required for table exctration with ExtractTable API
EXTRACTABLE_API_KEY=CHANGE_ME
Camelot is a python library for extracting tables. The documentation is available at https://camelot-py.readthedocs.io/en/master/.
We can use two flavors : stream
or lattice
. It can be specified in the
config as :
table_extraction:
- type: Camelot
params:
flavor: stream
The unstructured API is documented at
https://unstructured-io.github.io/unstructured/apis/api_sdks.html. In the config.yaml
, you can specify any of the parameters considered by shared.PartitionParameters although we already set strategy="hi_res", pdf_infer_table_structure="True"
.
For example, you can use their beta released model chipper
by setting in your
config.yaml
:
table_extraction:
- type: UnstructuredAPI
params:
hi_res_model_name: chipper
This API requires an API key. You can create one at
https://unstructured.io/api-key-free.
Once you have your key, you must copy the sample .env.sample
to .env
:
cp .env.sample .env
and then copy your key at
UNSTRUCTURED_API_KEY=CHANGE_ME
In addition to use the unstructured API, you can also run unstructured locally.
The parameters to be specified in your config.yaml
script are given to the
partition_pdf function, although we already set strategy="hi_res", infer_table_structure=True
.
You can for example set the pdf_image_dpi
as well as the table detection
algorithm by setting :
table_extraction:
- type: Unstructured
params:
pdf_image_dpi: 300
hi_res_model_name: "yolox"
The llama parse requires an API key.
To create a key, go to http://cloud.llamaindex.ai.
This key must be specified in the .env
file, a sample file being .env.example
:
# Required for table extraction with LLAMA PARSE API
LLAMA_CLOUD_API_KEY=CHANGE_ME
You can then use llama parse in your configuration as below. The parameters are forward to the constructor of LlamaParse
For example, you can customize the verbosity, ..
table_extraction:
- type: LlamaParse
params:
verbosity: False
Table cleaning is the last step of the pipeline, taking as input the parsed
tables and extracting the relevant information. You can specify multiple table
cleaners and that's the reason why table_cleaning
is a list in the
config.yaml
. Every list of tables extracted by every table extractor will be
processed by every table cleaner.
The table cleaners append their assets in the list under the table_cleaners
key. As for the table extractors, the table cleaners fill in the following
assets :
id
: a unique identifier for the table cleaner executiontype
: the type of table cleanerparams
: the parameters given for the construction of the cleanertable
: the output dataframe with the expected data per country
The list of available cleaners is given below :
The LangChain
module can be used by specifying in the config.yaml
:
table_cleaning:
- type: LLM
params:
openai_model: "gpt-4-turbo-preview"
For now, we only support OpenAI models but we may later also consider local
models. For OpenAI models, you need an API key (see OpenAI website) that must be provided in your
.env
file :
OPENAI_API_KEY=CHANGE_ME
With LangChain, you can also trace the LLMs request using LangSmith. Although optional, this might be usefull to keep an eye on the expenses for paid language models and to debug the context/questions/answers. LangSmith requires an API key to be created by login in at https://smith.langchain.com and a project name provided in your .env
file as :
LANGCHAIN_API_KEY=CHANGE_ME
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT="country-by-country"
python3 -m venv name-of-your-venv
source name-of-your-venv/bin/activate
python3 -m pip install "poetry==1.4.0"
Installer les dépendances:
poetry install
Ajouter une dépendance:
poetry add pandas
Mettre à jour les dépendances:
poetry update
jupyter notebook
and check your browser !
pre-commit run --all-files
tox -vv
Le filtre country_by_country/pagefilter/RFClassifier
utilise un arbre de décision ou des random forest entrainés par le notebook ci-dessous
Deux modèles semblent concluants mais ne produisent pas les mêmes résultats.