I am searching for a JavaScript library, which can read .doc
- and .docx
- files. The focus is only on the text content. I am not interested in pic
You can use docxtemplater for this (even if normally, it is used for templating, it can also just get the text of the document) :
var zip = new JSZip(content);
var doc=new Docxtemplater().loadZip(zip)
var text= doc.getFullText();
console.log(text);
See the Doc for installation information (I'm the maintainer of this project)
However, it only handles docx, not doc
now you can extract the text content from doc/docx without installing external dependencies.
You can use the node library called any-text
Currently, it supports a number of file extensions like PDF, XLSX, XLS, CSV etc
Usage is very simple:
npm i -D any-text
getText
method to read the text contentvar reader = require('any-text');
reader.getText(`path-to-file`).then(function (data) {
console.log(data);
});
async/await
notationvar reader = require('any-text');
const text = await reader.getText(`path-to-file`);
console.log(text);
var reader = require('any-text');
const chai = require('chai');
const expect = chai.expect;
describe('file reader checks', () => {
it('check docx file content', async () => {
expect(
await reader.getText(`${process.cwd()}/test/files/dummy.doc`)
).to.contains('Lorem ipsum');
});
});
I hope it will help!