xPDFSearch 1.11 - Content plugin to search text in PDF files

Lefteous · Post by *Lefteous » 2019-04-16, 22:50 UTC

As announced xPDFSearch is now a Github project. The idea is to improve source code management and collaboration. If you want to contribute you have to commit to your own remote feature branch and make a pull request.
https://github.com/lefteous-tc/xPDFSearch

hhk · Post by *hhk » 2020-01-21, 16:50 UTC

Dear Leftous,

today i installed your Plugin for a very tricky task:
there are ten-thousands of scanned PDFs, many of them contain text, some of them don´t. This depends on the various scanners they used over the years.
I have now to filter the non-text-PDFs to OCR them. Can i do this with your plugin?

ys

HHK

Usher · Post by *Usher » 2020-01-21, 19:40 UTC

2hhk
You should test another plugin: http://totalcmd.net/plugring/pdfOCR.html

nsp · Post by *nsp » 2020-01-22, 07:09 UTC

hhk wrote: 2020-01-21, 16:50 UTC Dear Leftous,

today i installed your Plugin for a very tricky task:
there are ten-thousands of scanned PDFs, many of them contain text, some of them don´t. This depends on the various scanners they used over the years.
I have now to filter the non-text-PDFs to OCR them. Can i do this with your plugin?

ys

HHK

What you can do with xpdfsearch is to find the one that have almost no text (less than 10 characters in the following sample) and then have a list to send to your ocr software.
In Search box, search for pdf files and in plugin tab add

Code: Select all

xpdfsearch text !regexp  .{10,}

if you know which producer / application created the image only pdf, you can also search for it using dedicated properties . (PDF Producer / Application )

Once you get the file to process by OCR, you can feed to listbox. From listbox, you can also save the list to a dedicated folder of virtual-panel or in a file. Once done, you can process all files one by one using a button/user command that call your OCR engine or all at once using TCBL.

If your OCR process need times and/or manual validation, one by one process is the best choice for you. virtual-panel can help you to track non processed files ....

The PdfOCR wcx if fine to extract text only from image but will not help to rebuild a quality pdf with schema text indentation, format ...I personally use it to extract dedicated information from pdf which does not support cut/paste

Usher · Post by *Usher » 2020-01-22, 09:34 UTC

nsp wrote: 2020-01-22, 07:09 UTC The PdfOCR wcx if fine to extract text only from image but will not help to rebuild a quality pdf with schema text indentation, format

It's complete misunderstanding. I mean WDX, content plugin. Read the linked webpage, please:

pdfOCD 0.9 wdx wrote: • Purpose:
pdfOCR is wdx plugin that discovers how many pages of PDF file in current directory needs character recognition (OCR), i.e. how many pages in PDF file have no searchable text in their layout.
(...)
• Possible usage:
- discover pdf documents which need to be OCR-ed for the first time
- discover PDF documents which are password protected and consequently not available for OCR processing
- discover PDF documents that was not properly OCR processed because of low resolution or similar causes
- discover PDF documents not properly formatted.

See also the linked image: http://wincmd.ru/files/9924358/prezentacija_mala.jpg

burstx · Post by *burstx » 2020-02-06, 09:51 UTC

I've just installed the latest TC 9.50 (x64) and tried to install xPDFSearch plugin downloaded from the official TC plugins page (the actual link to the plugin file).

Big issue
This is a sample PDF containing the text "PXContext", which is not found with the xPDFSearch plugin in use.

Small issue(perhaps this is a reason for the "big issue" described above)
If I open the plugin's .zip file inside the TC (i.e. in the files panel), the TC offers to install the plugin and the plugin is installed.
If I register the plugin via TC's "Configuration => Options..." menu, "Plugins=>Content Plugins (.WDX)" section, the error is shown:
Image: https://i.imgur.com/OcqUExd.png, although the plugin is claimed to be x32+x64-compatible.

Could you please check if there is a problem with the plugin or I configured/used it incorrectly?

burstx · Post by *burstx » 2020-03-05, 08:49 UTC

burstx wrote: 2020-02-06, 09:51 UTC I've just installed the latest TC 9.50 (x64) and tried to install xPDFSearch plugin downloaded from the official TC plugins page (the actual link to the plugin file).

Big issue
This is a sample PDF containing the text "PXContext", which is not found with the xPDFSearch plugin in use.

Small issue(perhaps this is a reason for the "big issue" described above)
If I open the plugin's .zip file inside the TC (i.e. in the files panel), the TC offers to install the plugin and the plugin is installed.
If I register the plugin via TC's "Configuration => Options..." menu, "Plugins=>Content Plugins (.WDX)" section, the error is shown:
Image: https://i.imgur.com/OcqUExd.png, although the plugin is claimed to be x32+x64-compatible.

Could you please check if there is a problem with the plugin or I configured/used it incorrectly?

Sorry, Big issue is my fault. I didn't RTFM. But the small issue remains.

Post by *ghisler(Author) » 2020-03-05, 10:05 UTC

Which file did you pick in the Content Plugins (.WDX)" section?

buckauction · Post by *buckauction » 2020-04-08, 14:47 UTC

nsp wrote: 2020-01-22, 07:09 UTC
hhk wrote: 2020-01-21, 16:50 UTC Dear Leftous,

I can find no information on how to inatall this plugin, I have unzipped it to a secondary folder still no info. there is a batch file and an exe file with no info either, Plese explain install process and how to work the program. Thanks it looks great.
What you can do with xpdfsearch is to find the one that have almost no text (less than 10 characters in the following sample) and then have a list to send to your ocr software.
In Search box, search for pdf files and in plugin tab add
Code: Select all
xpdfsearch text !regexp  .{10,}
if you know which producer / application created the image only pdf, you can also search for it using dedicated properties . (PDF Producer / Application )

Once you get the file to process by OCR, you can feed to listbox. From listbox, you can also save the list to a dedicated folder of virtual-panel or in a file. Once done, you can process all files one by one using a button/user command that call your OCR engine or all at once using TCBL.

If your OCR process need times and/or manual validation, one by one process is the best choice for you. virtual-panel can help you to track non processed files ....

The PdfOCR wcx if fine to extract text only from image but will not help to rebuild a quality pdf with schema text indentation, format ...I personally use it to extract dedicated information from pdf which does not support cut/paste

buckauction · Post by *buckauction » 2020-04-08, 14:48 UTC

What also is "WDX"?

Post by *petermad » 2020-04-08, 15:50 UTC

buckauction wrote: 2020-04-08, 14:48 UTC What also is "WDX"?

It is Content plugins for TC.

TC supports four types of plugins:
Packer plugins (WCX)
File System plugins (WFX)
Lister plugins (WLX)
Content plugins (WDX)

Help wrote: Configuration - Plugins

Change settings for all supported plugin types.

Download new plugins from ghisler.com
Connects to the page where you can download plugins which were tested by us.

Packer plugins Allows you to configure packer plugins. Usage: Files - Pack.

File system plugins Allows you to configure file system plugins. They allow to access file systems or similar devices or systems, e.g. a PocketPC, a Linux partition, or a remote server. File system plugins are used via the Network Neighborhood.

Lister plugins Allows you to configure Lister plugins. Usage: F3 on a supported file.

Content plugins Allows you to configure content plugins. Usage: Show - custom columns, multi-rename tool, search function.

FS-Plugins Allows you the installation of file system plugins. You can find them on www.ghisler.com in the addons section.

amalia · Post by *amalia » 2020-05-01, 17:01 UTC

I can not use xPDFSearch for finding greek words within pdf documents. Is there a solution?

pmicchow · Post by *pmicchow » 2021-03-08, 07:21 UTC

Dear Sir/Madam

I am in the process of conducting researches from a book which I have downloaded in both word and pdf formats. My researches require me to extract the contents from the book to include 2 groups or strings of words from some relevant text.
The following are 2 examples:

Example 1
First group (string): not only
Second group (string): but also
Relevant text 1:
He is not only intelligent but also funny.
Relevant text 2:
Mr X is not only an actor but also a philanthropist.

Example 2
First group: scarcely
Second group: when
Relevant text 1:
I had scarcely walked in the door when I got an urgent call and had to run right back out again.
Relevant text 2:
Scarcely had the teacher seen the student when he started studying.

My question is, how would I be able to extract the relevant text of the desired strings of words which are normally consisted of 2 groupings as demonstrated in the above 2 examples. Preferably, I would like to receive instructions on how to do so from both a word document and a pdf document.
I would like to thank you in advance.

Regards
Preston Chow

Post by *white » 2023-01-10, 23:40 UTC

Moderator message from: white » 2023-01-10, 23:38 UTC

Lefteous wrote: 2023-01-10, 22:45 UTC I would propose to split this thread (by moderators) at the point where you forked the plugin.

Done.

The thread about zeeko's fork is here: xPDFSearch 1.38 - Content plugin to search text in PDF files

Lefteous · Post by *Lefteous » 2023-01-16, 23:43 UTC

2white
Thank you!

Total Commander

xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files

Re: xPDFSearch 1.11 - Content plugin to search text in PDF files