Text selection in pdf with poppler

danigm's picture

Poppler is a PDF rendering library used in evince and okular.

In yaco we are working in evince accessibility and evince uses poppler, so we are working in poppler too.

Poppler does not make table selection in "order". It detects tables as columns, because poppler uses distance between text to decide what is a column so tables are selected in column order when the "logic way" is as rows.

Other problem in selection caused by that heuristic is when you have a pdf with near columns or text with spaces.

We looked at acroread to see how it does columns and tables selection and we realized that it selects text in "order", I mean, in the order that you put it in pdf file. To see that we created a text pdf file with inkscape.

So the selection logic is simple, we select the nearest word to the first selection point and the nearest word to the last selection point, and every word between that two words (in text order, no matter where the words are at screen) is selected too.

I have implemented that logic and it seems to work better that current one. I made a video to show the new logic implemented in action:

The first evince uses old poppler selection, and the second one uses the new poppler selection implementation. The pdf file used to make the video is here. That pdf was created with inkscape and have two text columns, in each column there are a lot of spaces between words to confuse old poppler selection. There are two tables too, the first one, in row order, and the second one in column order.

You can see in the video how the new selection algorithm selects tables in "order", the first one in rows and the second one in columns. That is the wright order because we insert in that order.

The new selection algorithm makes evince accessibility works better because it uses poppler selection to get the pdf text to expose in atk interface.

The new algorithm is simpler than the current one. The hardest part was the right to left selection, because words are ordered in pdf from left to right, so we need to treat in a different way when the text is in RTL.

Comments

2
shakaran's picture

Nice! This feature is amazing. I am waiting for this in all my university pdf documents.

It's really dificcult copy & paste with the old way. I hope that your implementation would be included as soon as posible.

Keep the good work!

xiangxw's picture

"we select the nearest word to the first selection point and the nearest word to the last selection point, and every word between that two words (in text order, no matter where the words are at screen) is selected too"

I use this logic too. But it won't work sometimes because the words Poppler rendered are not in the right order. I think it's a bug of Poppler.

You can download a PDF file for this problem.https://bugs.freedesktop.org/attachment.cgi?id=50442 Page 2 of the PDF file is not in the right order