Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Old May 22nd 05, 02:40 AM
Caveat Lector
 
Posts: n/a
Default Question on web spiders

Maybe not the right place but seems there are several web experts here.

Can web spiders read and harvest e-mail addresses from a pdf file ?

Many users and folks like QRZ.com are using jpegs not ascii for listing
e-mails -- this seems to work.

So for pdf files without going to a jpeg --- is ascii text addresses
harvestable ?

Thanks

--
CL -- I doubt, therefore I might be !







  #2   Report Post  
Old May 22nd 05, 03:06 AM
Dave Platt
 
Posts: n/a
Default

Maybe not the right place but seems there are several web experts here.

Can web spiders read and harvest e-mail addresses from a pdf file ?


It's certainly possible.

I don't know to what extent it's actually done. In principle, though,
any file which stores information as text can be searched (Google
apparently searches PDFs and indexes their contents, just as if it was
in HTML), and a spammer/harvester could do the same thing, though.

I have my doubts as to whether spammers really care to go to that much
work, though.

--
Dave Platt AE6EO
Hosting the Jade Warrior home page: http://www.radagast.org/jade-warrior
I do _not_ wish to receive unsolicited commercial email, and I will
boycott any company which has the gall to send me such ads!
  #3   Report Post  
Old May 22nd 05, 04:34 AM
Bob Bob
 
Posts: n/a
Default

At my old place of work I use to quite regularly parse PDF and PS files
for text. Not so easy on encrypted files or those PDF's from scans!

Would a spammer think its worthwhile? I mean they could also OCR the
image files on a website. They would only need to look for an "@" sign
and all would be revealed. I am sure they'll start doing this when the
current harvest starts going downhill!

Cheers Bob VK2YQA

Caveat Lector wrote:
Maybe not the right place but seems there are several web experts here.

Can web spiders read and harvest e-mail addresses from a pdf file ?

  #4   Report Post  
Old May 22nd 05, 05:14 AM
John Smith
 
Posts: n/a
Default

Security through obscurity is a poor answer, and even poorer security....

I wouldn't wait for the spammers to "get stupid", "run out of ideas", "end
up in prison", etc...

You are best actively taking part in providing your own protections... many
freeware/shareware/commercial programs which will virtually guarantee you
are spam free (one in every few thousand may leak through)--just gotta learn
how to use them effectively--they let the spammers have your email if they
want it... you will never see the spam if you just take the time to become
computer savvy... one of the best is K9, is free (well, donation-ware), and
guarantees no malware... if you know perl regex expressions it is DEADLY on
spam (if not, get a perl person to write 'em for ya--ask in a perl
newsgroup)...

Warmest regards,
John

"Caveat Lector" wrote in message
news:HxQje.1450$Xh.611@fed1read07...
Maybe not the right place but seems there are several web experts here.

Can web spiders read and harvest e-mail addresses from a pdf file ?

Many users and folks like QRZ.com are using jpegs not ascii for listing
e-mails -- this seems to work.

So for pdf files without going to a jpeg --- is ascii text addresses
harvestable ?

Thanks

--
CL -- I doubt, therefore I might be !









  #5   Report Post  
Old May 22nd 05, 06:41 AM
Richard Clark
 
Posts: n/a
Default

On Sat, 21 May 2005 17:40:07 -0700, "Caveat Lector"
wrote:
Can web spiders read and harvest e-mail addresses from a pdf file ?


Hi OM,

Yup, my Robots sure could at a couple of MB/min.

PDF is simply a proprietary markup language (as is Word).

73's
Richard Clark, KB7QHC


  #6   Report Post  
Old May 22nd 05, 06:54 PM
Roger Leone
 
Posts: n/a
Default

I have been using a java applet that encrypts my email address but allows
visitors to my website to email me. The applet I am using came from the
Hivelogic website but seems to have been removed. A possible substitute
(which I haven't tried myself) is at: http://leon.mvps.org/Encoder/

If you do a Google search using words like "email encryption" you will
probable find links to a number of different applications.

73,

Roger K6XQ


  #7   Report Post  
Old May 24th 05, 01:26 AM
Mr. Man with the Master Plan
 
Posts: n/a
Default

you can protect text in PDF format. I do it all the time so people cant copy
my work. Now they have to retype it


"Caveat Lector" wrote in message
news:HxQje.1450$Xh.611@fed1read07...
Maybe not the right place but seems there are several web experts here.

Can web spiders read and harvest e-mail addresses from a pdf file ?

Many users and folks like QRZ.com are using jpegs not ascii for listing
e-mails -- this seems to work.

So for pdf files without going to a jpeg --- is ascii text addresses
harvestable ?

Thanks

--
CL -- I doubt, therefore I might be !









  #8   Report Post  
Old May 24th 05, 01:44 AM
John Smith
 
Posts: n/a
Default

You are gravely mistaken if you think .pdf protects text at all... name a
small (10 pages or less) .pdf doc and I will a word doc of it in minutes...
..pdf is NOT a security format...

Warmest regards,
John

"Mr. Man with the Master Plan" wrote in message
...
you can protect text in PDF format. I do it all the time so people cant
copy my work. Now they have to retype it


"Caveat Lector" wrote in message
news:HxQje.1450$Xh.611@fed1read07...
Maybe not the right place but seems there are several web experts here.

Can web spiders read and harvest e-mail addresses from a pdf file ?

Many users and folks like QRZ.com are using jpegs not ascii for listing
e-mails -- this seems to work.

So for pdf files without going to a jpeg --- is ascii text addresses
harvestable ?

Thanks

--
CL -- I doubt, therefore I might be !











  #9   Report Post  
Old May 24th 05, 09:07 PM
Roger Leone
 
Posts: n/a
Default

I found the applet that used to be on the Hivelogic website:
http://automaticlabs.com/products/enkoder

You can either download it and run it on your PC or you can run it online.
Either way, the java code it produces looks like this:

script type="text/javascript"
//![CDATA[
function hiveware_enkoder(){var i,j,x,y,x=
"x=\"783d2232517d783635363d5c223634323034366636363 7363337323666353664366432"
+
"3635363037336537343730363265373136643737323265353 6393763323234363533653432"
+
"3832333663323363363936363132333662303638323036373 2363836353536363732363364"
+
"3535323063323237343636643666323031363937333636633 7353665343666363432336137"
+
"3036643236313635323663363036313536663665323665363 0363535343036643637333631"
+
"3639323633366332363736313363633666326636363236313 3653136633232323265363933"
+
"62653635333033373435625c223b633232793d27323037273 b663436396f7228373436693d"
+

This is actually only about half of the code. I doubt a spider would find
my address in there.

You then paste this code into your HTML.

73,

Roger K6XQ


  #10   Report Post  
Old May 25th 05, 06:16 AM
Mr. Man with the Master Plan
 
Posts: n/a
Default

if you know how to use Adobe , yes you can block it from printing out to
word as saved text.

My company produces competitive intelligence and we construct pdf's in a way
that you can NOT export the text in anyway, except if you want to type it
out by hand

If you want to take up the challenge let me know

We can meet on AIM or YIM, I can send you a PDF sample via file transfer and
see how long it takes for you to crack the PDF. If it is a software trick,
the size of the PDF wont matter.

I can type 10 pages of text in about 12 minutes too if thats what you were
thinking.

Otherwise, get a webcam or fly to new york and let me see how you do it
snappy.


"John Smith" wrote in message
news
You are gravely mistaken if you think .pdf protects text at all... name a
small (10 pages or less) .pdf doc and I will a word doc of it in
minutes... .pdf is NOT a security format...

Warmest regards,
John

"Mr. Man with the Master Plan" wrote in message
...
you can protect text in PDF format. I do it all the time so people cant
copy my work. Now they have to retype it


"Caveat Lector" wrote in message
news:HxQje.1450$Xh.611@fed1read07...
Maybe not the right place but seems there are several web experts here.

Can web spiders read and harvest e-mail addresses from a pdf file ?

Many users and folks like QRZ.com are using jpegs not ascii for listing
e-mails -- this seems to work.

So for pdf files without going to a jpeg --- is ascii text addresses
harvestable ?

Thanks

--
CL -- I doubt, therefore I might be !













Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 RadioBanter.
The comments are property of their posters.
 

About Us

"It's about Radio"

 

Copyright © 2017