Releases: HazyResearch/fonduer
v0.9.0
0.9.0 - 2021-06-22
This is a long-awaited release with some performance improvements and some breaking changes. See the changelog for details.
Added
- @HiromuHota: Support spaCy v2.3. (#506)
- @HiromuHota: Add
HOCRDocPreprocessor
andHocrVisualLinker
to support hOCR as input file. (#476) (#519) - @YasushiMiyata: Add multiline Japanese strings support to
fonduer.parser.visual_parser.hocr_visual_parser
. (#534) (#542) - @YasushiMiyata: Add commit process immediately after add to
fonduer.parser.Parser
. (#494) (#544)
Changed
-
@HiromuHota: Renamed
VisualLinker
toPdfVisualParser
, which assumes the followings: (#518)pdf_path
should be a directory path, where PDF files exist, and cannot be a file path.- The PDF file should have the same basename (
os.path.basename
) as the document. E.g., the PDF file should be either "123.pdf" or "123.PDF" for "123.html".
-
@HiromuHota: Changed
Parser
's signature as follows: (#518)- Renamed
vizlink
tovisual_parser
. - Removed
pdf_path
. Now this is required only byPdfVisualParser
. - Removed
visual
. Providevisual_parser
if visual information is to be parsed.
- Renamed
-
@YasushiMiyata: Changed
UDFRunner
's andUDF
's data commit process as follows: (#545)- Removed
add
process on single-thread in_apply
inUDFRunner
. - Added
UDFRunner._add
ofy
on multi-threads toParser
,Labeler
andFeaturizer
. - Removed
y
of document parsed result fromout_queue
inUDF
.
- Removed
Fixed
- @YasushiMiyata: Fix test code test_postgres.py::test_cand_gen_cascading_delete. (#538) (#539)
- @HiromuHota: Process the tail text only after child elements. (#333) (#520)
v0.8.3
0.8.3 - 2020-09-11
This is a big release with a lot of changes. These changes are summarized here. Check the Changelog for more details.
Added
- @YasushiMiyata: Add
get_max_row_num
tofonduer.utils.data_model_utils.tabular
. (#469) (#480) - @HiromuHota: Add get_bbox() to
Sentence
andSpanMention
. (#429) - @HiromuHota: Add a custom MLflow model that allows you to package a Fonduer model. See here for how to use it. (#259) (#407)
- @HiromuHota: Support spaCy v2.2. (#384) (#432)
- @wajdikhattel: Add multinary candidates. (#455) (#456)
- @HiromuHota: Add
nullables
tocandidate_subclass()
to allow NULL mention in a candidate. (#496) (#497) - @HiromuHota: Copy textual functions in
data_model_utils.tabular
todata_model_utils.textual
. (#503) (#505)
Changed
- @YasushiMiyata: Enable RegexMatchSpan with concatenates words by sep="(separator)" option. (#270) (#492)
- @HiromuHota: Enabled "Type hints (PEP 484) support for the Sphinx autodoc extension." (#421)
- @HiromuHota: Switched the Cython wrapper for Mecab from mecab-python3 to fugashi. Since the Japanese tokenizer remains the same, there should be no impact on users. (#384) (#432)
- @HiromuHota: Log a stack trace on parsing error for better debug experience. (#478) (#479)
- @HiromuHota:
get_cell_ngrams
andget_neighbor_cell_ngrams
yield nothing when the mention is not tabular. (#471) (#504)
Deprecated
- @HiromuHota: Deprecated
bbox_from_span
andbbox_from_sentence
. (#429) - @HiromuHota: Deprecated
visualizer.get_box
in favor ofspan.get_bbox()
. (#445) (#446) - @HiromuHota: Deprecate textual functions in
data_model_utils.tabular
. (#503) (#505)
Fixed
- @senwu: Fix pdf_path cannot be without a trailing slash. (#442) (#459)
- @kaikun213: Fix bug in table range difference calculations. (#420)
- @HiromuHota: mention_extractor.apply with clear=True now works even if it's not the first run. (#424)
- @HiromuHota: Fix
get_horz_ngrams
andget_vert_ngrams
so that they work even when the input mention is not tabular. (#425) (#426) - @HiromuHota: Fix the order of args to Bbox. (#443) (#444)
- @HiromuHota: Fix the non-deterministic behavior in VisualLinker. (#412) (#458)
- @HiromuHota: Fix an issue that the progress bar shows no progress on preprocessing by executing preprocessing and parsing in parallel. (#439)
- @HiromuHota: Adopt to mlflow>=1.9.0. (#461) (#463)
- @HiromuHota: Correct the entity type for NumberMatcher from "NUMBER" to "CARDINAL". (#473) (#477)
- @HiromuHota: Fix
_get_axis_ngrams
not to returnNone
when the input is not tabular. (#481) - @HiromuHota: Fix
Visualizer.display_candidates
not to draw rectangles on wrong pages. (#488) - @HiromuHota: Persist doc only when no error happens during parsing. (#489) (#490)
v0.8.2
0.8.2 - 2020-04-28
A summary of the changes of this release are below. Check the Changelog for more details.
Deprecated
- @HiromuHota: Use of undecorated labeling functions is deprecated and will not be supported as of v0.9.0. Please decorate them with
snorkel.labeling.labeling_function
.
Fixed
- @HiromuHota: Labeling functions can now be decorated with
snorkel.labeling.labeling_function
. (#400 <https://github.com/HazyResearch/fonduer/issues/400>
) (#401 <https://github.com/HazyResearch/fonduer/pull/401>
)
v0.8.1
0.8.1 - 2020-04-13
A summary of the changes of this release are below. Check the Changelog for more details.
Fonduer has a new mode
argument to support switching between different learning modes (e.g., STL or MLT).
Click to see example usage
# Create task for each relation.
tasks = create_task(
task_names = TASK_NAMES,
n_arities = N_ARITIES,
n_features = N_FEATURES,
n_classes = N_CLASSES,
emb_layer = EMB_LAYER,
model="LogisticRegression",
mode = MODE,
)
Added
- @senwu: Add
mode
argument in create_task to supportSTL
andMTL
.
v0.8.0
0.8.0 - 2020-04-07
A summary of the changes of this release are below. Check the Changelog for more details.
Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning.
Click to see example usage
# With Emmental, you need do following steps to perform learning:
# 1. Create task for each relations and EmmentalModel to learn those tasks.
# 2. Wrap candidates into EmmentalDataLoader for training.
# 3. Training and inference (prediction).
import emmental
# Collect word counter from candidates which is used in LSTM model.
word_counter = collect_word_counter(train_cands)
# Initialize Emmental. For customize Emmental, please check here:
# https://emmental.readthedocs.io/en/latest/user/config.html
emmental.init(fonduer.Meta.log_path)
#######################################################################
# 1. Create task for each relations and EmmentalModel to learn those tasks.
#######################################################################
# Generate special tokens which are used for LSTM model to locate mentions.
# In LSTM model, we pad sentence with special tokens to help LSTM to learn
# those mentions. Example:
# Original sentence: Then Barack married Michelle.
# -> Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
arity = 2
special_tokens = []
for i in range(arity):
special_tokens += [f"~~[[{i}", f"{i}]]~~"]
# Generate word embedding module for LSTM.
emb_layer = EmbeddingModule(
word_counter=word_counter, word_dim=300, specials=special_tokens
)
# Create task for each relation.
tasks = create_task(
ATTRIBUTE,
2,
F_train[0].shape[1],
2,
emb_layer,
mode="mtl",
model="LogisticRegression",
)
# Create Emmental model to learn the tasks.
model = EmmentalModel(name=f"{ATTRIBUTE}_task")
# Add tasks into model
for task in tasks:
model.add_task(task)
#######################################################################
# 2. Wrap candidates into EmmentalDataLoader for training.
#######################################################################
# Here we only use the samples that have labels, which we filter out the
# samples that don't have significant marginals.
diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
train_idxs = np.where(diffs > 1e-6)[0]
# Create a dataloader with weakly supervisied samples to learn the model.
train_dataloader = EmmentalDataLoader(
task_to_label_dict={ATTRIBUTE: "labels"},
dataset=FonduerDataset(
ATTRIBUTE,
train_cands[0],
F_train[0],
emb_layer.word2id,
train_marginals,
train_idxs,
),
split="train",
batch_size=100,
shuffle=True,
)
# Create test dataloader to do prediction.
# Build test dataloader
test_dataloader = EmmentalDataLoader(
task_to_label_dict={ATTRIBUTE: "labels"},
dataset=FonduerDataset(
ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
),
split="test",
batch_size=100,
shuffle=False,
)
#######################################################################
# 3. Training and inference (prediction).
#######################################################################
# Learning those tasks.
emmental_learner = EmmentalLearner()
emmental_learner.learn(model, [train_dataloader])
# Predict based the learned model.
test_preds = model.predict(test_dataloader, return_preds=True)
Changed
- @senwu: Switch to Emmental as the default learning engine.
- @HiromuHota: Change ABSTAIN to -1 to be compatible with Snorkel of 0.9.X. Accordingly, user-defined labels should now be 0-indexed (used to be 1-indexed). (#310) (#320)
- @HiromuHota: Use executemany_mode="batch" instead of deprecated use_batch_mode=True. (#358)
- @HiromuHota: Use tqdm.notebook.tqdm instead of deprecated tqdm.tqdm_notebook. (#360)
- @HiromuHota: To support ImageMagick7, expand the version range of Wand. (#373)
- @HiromuHota: Comply with PEP 561 for type-checking codes that use Fonduer.
- @HiromuHota: Make UDF.apply of all child classes unaware of the database backend, meaning PostgreSQL is not required if UDF.apply is directly used instead of UDFRunner.apply. (#316) (#368)
Fixed
- @senwu: Fix mention extraction to return mention classes instead of data model classes.