This repository has been archived by the owner on Mar 22, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathNEWS
390 lines (269 loc) · 13.2 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
* Version 2.1.0, March 15. 2019
New in this version:
- CPR rule is no longer located on the scannerjob model.
- Matches now have Windows friendly paths.
- Clicking a path in a filescan report will automatically copy it to the clipboard.
- Ordinary users can create domains again.
- Better handling of Libreoffice processes.
- Dead links found during webscan can be viewed properly in the report again.
- Interface for validating domains works again.
- A number of bug fixes: #27664, #27407 and #27037
* Version 2.0.0, January 17. 2019
New in this version:
- Exchange scan is now supported. The exchange folders are downloaded first, and when finished, a file scan is started on the downloaded data.
- Starting new scanjobs is now done using rabbitmq.
- Scans are now only processing files the datascanner supports. That would be: text, csv, xml, html, pdf, image files (jpg, png, tiff, gif), ms office files.
- XML processor added.
- Better error handling during scans.
- A number of bug fixes: #26044, #26727 (the most important one. Sitemap processing was not always done before scan started), #24001, #25362
* Version 1.8.0.6, September 25. 2018
Hotfix:
- Cherrypicked enhancements made during filescan dev.
- Now run.py store_stats method makes sure only one statistics object can be present. Before the fix the error corrupted the scanjob report and made it return http 500.
- Some html pages resulted in a unicodedecodeerror during scan. They are now in a try except block.
- Detecting encoding is now done by a method in utils.py using chardet.
* Version 1.8.0.5, August 20. 2018
Hotfix:
- Fixed change password page so it does not return error code 500.
- Fixed orgs and domains page so it does not return error code 500.
- Removed unneccesary logging from scanner log showed in the final report.
* Version 1.8.0.4, August 14. 2018
Hotfix:
- Fixed cron job error. Due to inheritance scanner.domains is on subclass level. Cron job now collects scanners with subclass type.
* Version 1.8.0.3, August 10. 2018
Hotfix:
- Corrected indexerror during processor logging.
* Version 1.8.0.2, August 8. 2018
Hotfix:
- cron.py was failing due to wrong import path.
- Migration was failing. Corrected error.
- In production the number of processors should be 8.
* Version 1.8.0.1, July 30. 2018
Hotfix:
- Webscanner threw an error when extracting response objects that are files. Is now checking if response object is a HTMLResponse object.
- Default email address is changed from [email protected] to [email protected] which is the mail group related to webscanner.
* Version 1.8.0, June 20. 2018
New in this version:
- Network share scan feature is added.
- Ad hoc rules can now be combined into a set of rules.
- Models.py file has been refactored from one very large file to many small.
- Simple scan statistics have been added to reports.
- Upgraded to django 1.11.9
* Version 1.7.1, May 28. 2018
Hotfix:
- Now ignores if digital rights management bit is set for pdf files.
- Enhanced logging for pdftohtml conversion subprocess call.
- UnicodeEncodeError fixed when doing md5 calculation.
* Version 1.7.0, May 25, 2018
Hotfix:
- Webscanner now uses latest version of Scrapy 1.5.0. This solves https scanning problems.
* Version 1.6.1, December 6, 2017
New in this:
- Rules and Organization are now on the same page divided by tabs.
- Reports and Summaries are now on the same page divided by tabs.
- Front page now contains tiles.
- 'Scannere' is changed to 'Scannerjobs'.
- English words in the interface are changed to danish.
* Version 1.6.0.1, May 8, 2017
Hotfix:
- Limit broken links to 100 for consistency with critical matches (and
performance).
- Don't allow customer's user admin to set the "is_staff" flag (for
now).
* Version 1.6.0, May 2, 2017
New in this version:
- User administration for site superusers and reset password function
for all users.
- Reports page is paginated to reduce load time.
- Visual cue if report has broken links or critical matches.
- Better indication of how to find broken links on the scanned pages.
- Bug fix: Scan no longer fails if broken link scan is on and an
invalid URL is encountered.
* Version 1.5.2, March 16, 2017
New in this version:
- Change subject of scan mail to include clear warning if critical
matches.
- It's now possible for organizations to only receive mail if scans
have critical matches.
* Version 1.5.1.1, April 1, 2016
Hotfix:
- Committed migration for the CPR whitelist fields.
- Added new system dependency on mailutils.
* Version 1.5.1, April 1, 2016
New in this version:
- Simple, cron-based monitoring of the process_manager program.
* Version 1.5.0, March 30, 2016
New in this version:
- New, more inclusive CPR number rule.
- Local whitelists for CPR numbers.
- Upgraded to Django 1.8
* Version 1.4.1.3, March 23, 2016
Hotfix:
- Handle multiline regular expressions, cutoff if the match exceeds
1024 characters.
* Version 1.4.1.2, February 23, 2016
Hotfix:
- Don't crash if sitemap URLs have errors, just log it and keep going.
* Version 1.4.1.1, October 23, 2015
New in this version:
- Better handling of PDF files in latin-1.
* Version 1.4.1, October 13, 2015
New in this version:
- Fix bug in LibreOffice conversion
* Version 1.4.0, June 12, 2015
New in this version:
- Hide ad hoc scanners created for spreadsheet scans.
- Introduce job to clean up non-terminating LibreOffice processes.
- Consistent branding throughout (the product's name is 'OS2Webscanner'
and nothing else).
- Include context and page number for CPR matches to make them easier
to find.
- Check Last-Modified in meta http-equiv HTML headers.
- Visibly mark links in reports as visited.
- Performance: Save MD5 sums of binary files to avoid scanning the same
file several times.
- Better diagnostic information when conversion fails.
- Collect cookies (so far only server-side cookies are supported).
- Fix bugs in and improve installation guide.
- Proper OCR of documents that are scanned upside down.
- Improve layout of report.
- More consistent terminology.
- Include information about cookies in summary report.
* Version 1.3.3, March 13, 2015
New in this version:
- Make it possible to limit scanning of spreadsheets to certain columns.
* Version 1.3.2, March 5, 2015
New in this version:
- Delivery to Slagelse approved and stabilized.
- Fixed encoding issue by adding a UTF-8 BOM to the CSV output file.
* Version 1.3.1.3, February 23, 2015
Hotfix:
- Allow domains to be excluded, i.e. allow subdomain to be excluded from a
*-domain.
* Version 1.3.1.2, February 18, 2015
Hotfix:
- Add init script for automatic start of process manager.
* Version 1.3.1.1, February 18, 2015
Hotfix:
- Fix UnicodeEncodeError when users uploaded a spreadsheet with Danish
characters in file name.
* Version 1.3.1, February 5, 2015
New in this version:
- Address scanning adjusted to give a match whenever a valid street name is
found and a critical match if a street name is found with a house number.
- Name scanning will eliminate leading and tailing capitalized words to
avoid false negatives.
* Version 1.3.0.1, January 28, 2015
Hotfix:
- Format of returned CSV fixed.
* Version 1.3.0, January 28, 2014
New in this version:
- Support for spreadsheets with the option to modify (hide) data found by
the address, name or CPR rule.
- Name scanning enhanced to support more liberal formatting of names,
including hyphen-concatenation and abbreviations.
- Address scanning is implemented. In order to yield a match, at least an
existing Danish streetname must be found.
- A special web client - basically, an upload form - allows users to upload
spreadsheets for scanning. A new "upload only" user profile of users who can
*only* access that upload form and nothing else.
- For each organisation, a global whitelist and blacklist has been added for
name and address scanning.
- Whitelists and blacklists are available under the menu item "Oplæring".
- Reports from scanning of spreadsheets are visible in the reports list - as
opposed to the reports from API calls in previous versions.
- CSV output files always use semicolon as separator (not comma, and not
defaults).
- Default values for replacement text is fixed as "NAVN" for name rule,
"ADRESSE" for address rule and "xxxxxx-xxxx" for CPR rule.
- Bug fix: Name scanning will now match names with more than two middle names.
- Bug fix: Don't attempt to link to file URLs.
- Bug fix: Domains are not listed on report when scanning files or scanning
URL list through API.
- Bug fix: Blacklisting of street names and not just individual addresses now
works.
NOTE: All of the functionality in this release has been requested and funded by
Slagelse Kommune.
* Version 1.2.0, November 14, 2014
New in this version:
- Summary report - summaries of certain scanners' results which may be
emailed to users.
- Automatic retrieval of sitemap.xml, i.e. no explicit upload required.
- You can now specify a number of individual recipients of scan reports.
The organization contact will no longer receive these emails by default.
- Details of specific scan are always copied to the scan report; a technical
occurrence log has been added to the report to help diagnose scan failures
and conversion errors, etc.
- Minimize false positives: Many more irrelevant CPR matches are ignored.
- Group concept to enforce access limitations within one organization is
mostly implemented; due to some issues, its use is discouraged until next
sprint.
- CSV file with report was broken in some cases.
- Overview page listing domains per organisation is now available for
superusers.
- Pooling of PostgreSQL connections is now possible (but STRONGLY
DISCOURAGED!)
- Scanner listing is now sorted by date.
- The Name rule has been completely removed from the GUI.
- Small images are ignored, i.e. not OCR'ed.
- Organization is included in all lists where relevant (superuser only).
- Scheduled scans didn't run if set for specific dates.
- Admin-related menu items have been consolidated in an admin menu.
- Installation instructions have been updated and now work.
- Installation instructions include how to set up crontab for scheduled
scanning.
- Links in scan reports now open in a new window.
- Minimize false positives: CPR matches are ignored in inline style tags.
- New widget to maintain many to many links, including a search field to
support very long lists.
- Improved GUI design.
- Spreadsheets in formats XLSX/XLS/ODS are now scanned correctly.
- Improvements and bug-fixes to the XML-RPC interface.
- Sample RPC client has been improved significantly and now supports passing
parameters for the scan, and saving the report to a CSV file.
- Schedule information now shown as Yes/No in scanner list.
- Documents sent and scanned via XML-RPC now retain file names in the URL.
* Version 1.1.0.5, November 4, 2014
- Bug fix: Failed conversion items did not get their temporary directories
removed.
- Better error handling in processing of conversion items if an
image's dimensions cannot be determined.
* Version 1.1.0.4, October 29, 2014
- Performance improvement: When the number of OCR items per scan reaches a
certain limit, non-OCR conversions are paused to allow the number of OCR
items to fall to a reasonable level before being resumed again. For large
scans with OCR enabled, this is necessary because so many OCR items are
extracted from PDFs or Office documents that it exhausts the number of
available inodes on the filesystem.
- Connnection pooling is removed due to database problems.
* Version 1.1.0.3, October 10, 2014
New in this version:
- Bug fix: Catch InternalError exception in process_manager.py.
- Conversion queue items of a given type are now picked randomly from among
the active scans to avoid one scan hogging the queue.
* Version 1.1.0.2, October 10, 2014
New in this version:
- Bug fix: CPR check using wrong digit to validate birth date. The CPR rule
was using the 8th digit instead of the 7th digit of the CPR number to
validate the birth date of CPR numbers.
- Periodically delete conversion queue items from finished scans that may
not have been properly cleaned up when they finished.
- Ignore and delete images extracted from converted files if their are
dimensions do not meet minimum dimensions: both width and height must be
>= 7 pixels and at least one dimension must be >= 64 pixels.
- Scans are logged to separate files in var/logs/scans rather than
polluting the web server's error log file.
- Pool PostgreSQL connections to avoid too heavy load on the database server.
- System status page for superusers.
* Version 1.1.0, September 23, 2014
New in this version:
- Better parsing of names to avoid false positives.
- Linkchecker functionality included.
- Modulus-11 tweaking.
- A huge number of minor GUI fixes.
- Web service for scanning URLs or documents from other programs etc.
- Set proper permissions for scan data.
- Use Last-Modified check to disregard previous scans.
- Allow scanning of subdomains.
- Disk usage/disk performance improvements.
- 10 seconds timeout on PDF files in the link checker.