Nice tool. I was really hoping though it can help me tabulate messy copied text data from PDF.
For example, from Philips 2018 annual report [0], I copy the income statement, and I get the below when I paste. I found it impossible to get this into Excel or any other table format, without writing a Python program for it. Your tool still made it as one column. If there's a way you can automatically detect the 3 numeric columns below, you can have a large audience of finance folk analyzing PDF documents.
Sales. 17,422 17,780 18,121
Cost of sales (9,484) (9,600) (9,568)
Gross margin 7,939 8,181 8,554
Selling expenses (4,142) (4,398) (4,500)
General and administrative expenses (658) (577) (631)
Research and development expenses (1,669) (1,764) (1,759)
6 Other business income. 17 152 88
6 Other business expenses. (23) (76) (33)
6 Income from operations. 1,464 1,517 1,719
7 Financial income. 65 126 51
7 Financial expenses. (507) (263) (264)
Investments in associates, net of income taxes 11 (4) (2)
Income before taxes 1,034 1,377 1,503
8 Income tax expense. (203) (349) (193)
Income from continuing operations 831 1,028 1,310
3 Discontinued operations, net of income taxes. 660 843 (213)
Net income 1,491 1,870 1,097
Attribution of net income
Net income attributable to Koninklijke Philips N.V. shareholders 1,448 1,657 1,090
Net income attributable to non-controlling interests 43 214 7
Tabula is a helpful tool for extracting tables from PDFs, although its more for large tables of data, often spanning many pages, rather than the odd copy-and-paste.
As for your specific example, you can download tables from EDGAR in other formats, like HTML and iXBRL. The HTML table will usually paste into Excel well.
The unfortunate part of it is it's parsing the data based on the characters it finds in the text being processed, so if when you copy the data from your PDF reader, I'm guessing the data is positioned in the document using X/Y coordinates which is why it can't be formatted correctly.
I will definitely look at the document and see if my assumptions are incorrect, and if there is a different delimiter being used then it may be something I can work with.
For example, from Philips 2018 annual report [0], I copy the income statement, and I get the below when I paste. I found it impossible to get this into Excel or any other table format, without writing a Python program for it. Your tool still made it as one column. If there's a way you can automatically detect the 3 numeric columns below, you can have a large audience of finance folk analyzing PDF documents.
Sales. 17,422 17,780 18,121 Cost of sales (9,484) (9,600) (9,568) Gross margin 7,939 8,181 8,554 Selling expenses (4,142) (4,398) (4,500) General and administrative expenses (658) (577) (631) Research and development expenses (1,669) (1,764) (1,759) 6 Other business income. 17 152 88 6 Other business expenses. (23) (76) (33) 6 Income from operations. 1,464 1,517 1,719 7 Financial income. 65 126 51 7 Financial expenses. (507) (263) (264) Investments in associates, net of income taxes 11 (4) (2) Income before taxes 1,034 1,377 1,503 8 Income tax expense. (203) (349) (193) Income from continuing operations 831 1,028 1,310 3 Discontinued operations, net of income taxes. 660 843 (213) Net income 1,491 1,870 1,097 Attribution of net income Net income attributable to Koninklijke Philips N.V. shareholders 1,448 1,657 1,090 Net income attributable to non-controlling interests 43 214 7
[0] https://www.philips.com/c-dam/corporate/about-philips/sustai...