Skip to content

Commit

Permalink
fixes #65, getting cyrillic going
Browse files Browse the repository at this point in the history
  • Loading branch information
dbashford committed Nov 23, 2015
1 parent 8dcd784 commit 936b890
Show file tree
Hide file tree
Showing 5 changed files with 12 additions and 2 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ textract.fromUrl(url, config, function( error, text ) {})

### 1.2.0
* [#66](https://github.com/dbashford/textract/issues/66). textract will no longer put the info text to stdout about the extractors not being available or installed correctly. Instead, if you attempt to use a supported extractor that did not initialize correctly, you will get an updated error message indicating that the type is supported by textract but that external dependencies were not located. As part of this update, error messages were updated a bit to list both the type and the file.
* [#65](https://github.com/dbashford/textract/issues/65). Fixed issue where for `.odt` and `.docx` files with varying non-Latin characters (ex: cyrillic) were being stripped entirely of their content.

### 1.1.2
* [#63](https://github.com/dbashford/textract/pull/63). PR added support for CSV.
Expand Down
1 change: 0 additions & 1 deletion lib/util.js
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ var replaceTextChars = function(text) {
return text.trim()
.replace( SINGLE_QUOTES, "'" )
.replace( DOUBLE_QUOTES, '"' )
.replace( NON_ASCII_CHARS, '' )
.replace( MULTI_SPACES, ' ' );
};

Expand Down
Binary file added test/files/cyrillic.docx
Binary file not shown.
10 changes: 10 additions & 0 deletions test/general_test.js
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,16 @@ describe('textract', function() {
});
});

it('can handle cyrillic', function(done) {
var filePath = path.join(__dirname, 'files', 'cyrillic.docx');
fromFileWithPath(filePath, function( error, text ) {
expect(error).to.be.null;
expect(text).to.be.a('string');
expect(text.substring(0,100)).to.eql( "Актуальность диссертационного исследования определяется необходимостью развития методологического об" );
done();
});
});

describe("with multi line files", function() {
it('strips line breaks', function(done) {
var filePath = path.join(__dirname, 'files', 'multi-line.txt');
Expand Down
2 changes: 1 addition & 1 deletion test/util_test.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ describe('textract util', function() {
it('should normalize text output', function() {
var text = " “” ‘’ ą \n\n some text";
var result = util.replaceTextChars(text);
expect(result).to.equal("\"\" '' \n\n some text");
expect(result).to.equal("\"\" '' ą \n\n some text");
});

});

0 comments on commit 936b890

Please sign in to comment.