From abellmt@spsp.net Wed May 14 22:56:43 2003 From: abellmt@spsp.net (Martin Abell) Date: Wed, 14 May 2003 17:56:43 -0400 Subject: [SearchEngine] Indexing short circuited. Message-ID: Doug, Wondered if there was some sorting of directories that might cause some files to be missed. Was trying to index a small site and some files seemed to be skipped. Searching on site verifies they are not found. Terminal screen is attached. (Listing is truncated on left to avoid some wrapping.) Any thoughts? Martin Abell SpeedSpan ========================== ls -asl lookman 7145 May 14 16:09 LMR_10.shtml lookman 12688 May 14 16:10 LMR_12.shtml lookman 14908 May 14 16:07 LMR_13_final.shtml lookman 18008 May 14 16:07 LMR_13.shtml lookman 12865 May 14 16:07 LMR_14_rev1.shtml lookman 13910 May 14 16:08 LMR_15_rev1.shtml lookman 12848 May 14 16:08 LMR_16_rev1.shtml lookman 20827 Mar 18 16:15 LMR_21_Super_Bowl_2003.shtml lookman 10783 May 14 11:31 LMR_Week_17_rev1.shtml lookman 8122 May 14 11:32 LMR_Week_18_rev1.shtml lookman 17508 May 14 16:34 LMR_Wk_20.rev1_SB_03_prvw.shtml [root@sp11 articles]# cd /home/lookman/searchengine/data/ [root@sp11 data]# ../bin/Search index.xml -i http://www.lookmanreport.com/ -d www.lookmanreport.com Indexing http://www.lookmanreport.com/...done. Indexing http://www.lookmanreport.com/articles/LMR_21_Super_Bowl_2003.shtml...done. Indexing http://www.lookmanreport.com/articles/LMR_Week_17_rev1.shtml...done. Indexing http://www.lookmanreport.com/articles/LMR_Week_18_rev1.shtml...done. Indexing http://www.lookmanreport.com/articles/LMR_Wk_20_rev1_SB_03_prvw.shtml...done From abellmt@spsp.net Thu May 15 01:35:34 2003 From: abellmt@spsp.net (Martin Abell) Date: Wed, 14 May 2003 20:35:34 -0400 Subject: [SearchEngine] Regarding my earlier: Indexing short circuited. In-Reply-To: Message-ID: Hi again, Probably should have mentioned in the earlier email that there is only an index.html page and the (slightly edited) listing I pasted in is of the articles/ directory. There are 11 "articles" there, but only 4 seem to get indexed. One other thought occurred to me: maybe the Search program can't read the files because the line endings are messed. So I ran the dos2unix utility on one of the files and ran the index again, but that file still didn't get picked up. Which is not to say there still might be something messed with them. BTW, all the permissions (which I lopped off the listing) are the same for all the articles. Still thinking on it, Martin SpeedSpan ------ Forwarded Message > From: Martin Abell > Date: Wed, 14 May 2003 17:56:43 -0400 > To: > Subject: Indexing short circuited. > > Doug, > > Wondered if there was some sorting of directories that might cause some > files to be missed. > > Was trying to index a small site and some files seemed to be skipped. > Searching on site verifies they are not found. Terminal screen is attached. > (Listing is truncated on left to avoid some wrapping.) > > Any thoughts? > > Martin Abell > SpeedSpan > > ========================== > ls -asl > lookman 7145 May 14 16:09 LMR_10.shtml > lookman 12688 May 14 16:10 LMR_12.shtml > lookman 14908 May 14 16:07 LMR_13_final.shtml > lookman 18008 May 14 16:07 LMR_13.shtml > lookman 12865 May 14 16:07 LMR_14_rev1.shtml > lookman 13910 May 14 16:08 LMR_15_rev1.shtml > lookman 12848 May 14 16:08 LMR_16_rev1.shtml > lookman 20827 Mar 18 16:15 LMR_21_Super_Bowl_2003.shtml > lookman 10783 May 14 11:31 LMR_Week_17_rev1.shtml > lookman 8122 May 14 11:32 LMR_Week_18_rev1.shtml > lookman 17508 May 14 16:34 LMR_Wk_20.rev1_SB_03_prvw.shtml > [root@sp11 articles]# cd /home/lookman/searchengine/data/ > [root@sp11 data]# ../bin/Search index.xml -i http://www.lookmanreport.com/ > -d www.lookmanreport.com > Indexing http://www.lookmanreport.com/...done. > Indexing > http://www.lookmanreport.com/articles/LMR_21_Super_Bowl_2003.shtml...done. > Indexing > http://www.lookmanreport.com/articles/LMR_Week_17_rev1.shtml...done. > Indexing > http://www.lookmanreport.com/articles/LMR_Week_18_rev1.shtml...done. > Indexing > http://www.lookmanreport.com/articles/LMR_Wk_20_rev1_SB_03_prvw.shtml...done > . > Indexing http://www.lookmanreport.com/index.html...done. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: searchengine-unsubscribe@douglasthrift.net > For additional commands, e-mail: searchengine-help@douglasthrift.net > ------ End of Forwarded Message From douglaswth@earthlink.net Thu May 15 06:57:00 2003 From: douglaswth@earthlink.net (Douglas William Thrift) Date: Wed, 14 May 2003 22:57:00 -0700 Subject: [SearchEngine] Re: Regarding my earlier: Indexing short circuited. References: Message-ID: <001901c31aa6$f65e62e0$0100a8c0@mshome.net> Hello Martin, Usually when something funny happens I use turn on the debug mode ("-D") which will print some more information to stderr. The indexer can handle different line endings okay with HTTP. I ran the indexer here with your options and debug on and there is nothing unusual except that it doesn't seem to be picking up any links to the other pages. Then, I looked at the home page in my browser and it looks like the links for Weeks 10 through 16 are all pointing to the week 17 page. :) _______________________________________________________________________ Douglas William Thrift ----- Original Message ----- From: "Martin Abell" To: Sent: Wednesday, May 14, 2003 5:35 PM Subject: Regarding my earlier: Indexing short circuited. > Hi again, > > Probably should have mentioned in the earlier email that there is only an > index.html page and the (slightly edited) listing I pasted in is of the > articles/ directory. There are 11 "articles" there, but only 4 seem to get > indexed. > > One other thought occurred to me: maybe the Search program can't read the > files because the line endings are messed. So I ran the dos2unix utility on > one of the files and ran the index again, but that file still didn't get > picked up. Which is not to say there still might be something messed with > them. BTW, all the permissions (which I lopped off the listing) are the > same for all the articles. > > Still thinking on it, > > Martin > SpeedSpan > > ------ Forwarded Message > > From: Martin Abell > > Date: Wed, 14 May 2003 17:56:43 -0400 > > To: > > Subject: Indexing short circuited. > > > > Doug, > > > > Wondered if there was some sorting of directories that might cause some > > files to be missed. > > > > Was trying to index a small site and some files seemed to be skipped. > > Searching on site verifies they are not found. Terminal screen is attached. > > (Listing is truncated on left to avoid some wrapping.) > > > > Any thoughts? > > > > Martin Abell > > SpeedSpan > > > > ========================== > > ls -asl > > lookman 7145 May 14 16:09 LMR_10.shtml > > lookman 12688 May 14 16:10 LMR_12.shtml > > lookman 14908 May 14 16:07 LMR_13_final.shtml > > lookman 18008 May 14 16:07 LMR_13.shtml > > lookman 12865 May 14 16:07 LMR_14_rev1.shtml > > lookman 13910 May 14 16:08 LMR_15_rev1.shtml > > lookman 12848 May 14 16:08 LMR_16_rev1.shtml > > lookman 20827 Mar 18 16:15 LMR_21_Super_Bowl_2003.shtml > > lookman 10783 May 14 11:31 LMR_Week_17_rev1.shtml > > lookman 8122 May 14 11:32 LMR_Week_18_rev1.shtml > > lookman 17508 May 14 16:34 LMR_Wk_20.rev1_SB_03_prvw.shtml > > [root@sp11 articles]# cd /home/lookman/searchengine/data/ > > [root@sp11 data]# ../bin/Search index.xml -i http://www.lookmanreport.com/ > > -d www.lookmanreport.com > > Indexing http://www.lookmanreport.com/...done. > > Indexing > > http://www.lookmanreport.com/articles/LMR_21_Super_Bowl_2003.shtml...done. > > Indexing > > http://www.lookmanreport.com/articles/LMR_Week_17_rev1.shtml...done. > > Indexing > > http://www.lookmanreport.com/articles/LMR_Week_18_rev1.shtml...done. > > Indexing > > http://www.lookmanreport.com/articles/LMR_Wk_20_rev1_SB_03_prvw.shtml...done > > . > > Indexing http://www.lookmanreport.com/index.html...done. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: searchengine-unsubscribe@douglasthrift.net > > For additional commands, e-mail: searchengine-help@douglasthrift.net > > > > ------ End of Forwarded Message > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: searchengine-unsubscribe@douglasthrift.net > For additional commands, e-mail: searchengine-help@douglasthrift.net > >