The client is an accounting researcher. She analyzed publicly available corporate documents in order to test a hypothesis.
Every publicly traded corporation is required to publish a document that describes its activities. The document called form 10-K contains a comprehensive summary of a company’s financial performance. Part 1 Item 1 of the document is of particular interest, it describes the business of the company: what the company does, what markets it operates in etc.
In addition to publishing form 10-Ks the companies must also declare which industry classes they belong to. Industry classes follow Standard Industrial Classification. Each industry has its own four-digit code.
The hypothesis states that due to certain legal changes in 1998 the companies started to underreport the industry classes they belong to.
We developed a classifier loosely based on Paragraph Vector model to help test the hypothesis. The classifier took as input a company’s form 10-K document and outputted the industry classes the company should be ascribed to. The outputted industry classes list was then compared with the reported industry classes list thus confirming or denying the hypothesis.