Approximate location of current lecture

 

Course Syllabus and Home Page

Text Mining, Processing, and the Internet 

(Cpr E 587)

 

Spring 2005: MWF 12:10-1:00 in Coover 3126

 



Instructor:

Daniel Berleant

Coover Hall 3215

294-3959

berleant@iastate.edu

home page http://class.ee.iastate.edu/berleant/home

Course Description: the flyer and email; official description page 1 and page 2; registration data

Do an interesting project:

o    Build a basic Web browser or search engine, etc.

o    Use/learn Java or another language of your choice

o    Projects can be done individually or in teams as preferred by the student  

o    Students without programming experience may do a non-programming term project

Read/discuss the latest scientific literature on

o    Interacting with text

o    Text mining over the Web

o    Modern topics in information retrieval

Here are some answers to common questions about the course.

 

Course Objectives

At a minimum:

Prerequisites

Please see me if you have any doubts about your interests or background. Many students will enjoy the opportunity to do an interesting Web-related project and some will want to use this as an opportunity to learn a new language for this purpose. However, you should also be willing to take a serious interest in cutting edge knowledge and research results on text processing and text mining.

 

Textbook  

R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, ACM Press and Addison-Wesley. Paperback - 513 pages 1 edition (May 1999) Addison-Wesley Pub Co; ISBN: 020139829X

 

Scholarly papers will also be assigned from such forums as

Related courses: http://www.ececs.uc.edu/~annexste/Courses/cs690/

http://blondie.cs.byu.edu/CS652

www.isi.edu/info-agents/courses/csci599

 

 

Grading  (click for current grades)

·    Minimum grades will be 50% (e.g. if you don't hand it in), and maximum will be 100%. Lateness will incur a penalty of 10% for 1 day or less, and more for longer periods. Also it may not be possible to get credit for quiz or HW questions that are discussed in class, after the discussion.

·    No grades will normally be dropped, but it is possible to not count a missed quiz or small assignment on rare occasions such as illness.

·    There will be regular reading assignments, so that we can discuss things in class. When the instructor was a grad student he occasionally neglected to read papers before class, and therefore did not learn much on those days. We do not want this misfortune to befall you! Therefore there will often be short quizzes at the beginning of class on the readings, usually worth up to 100 pts. each. (This can also help provide motivation to get to class on time.)

·    Homework exercises may be assigned as needed to understand the material. Usually they will be worth up to 100 pts. each

·    There will be a term project which will count for half of the points in the course.

·    Letter grades will be assigned as follows:

A (95-100%), A- (90-95%), B+ (86.67-90%), B (83.33-86.67%), B- (80-83.33%), C+ (76.67-80%), C (73.33-76.67%), C- (70-73.33%), D+ (66.67-70%), D (63.33-66.67%), D- (60-63.33%), F (50-60%)  

 

If your grade is ambiguous (e.g. exactly 95% is listed as both A and A- above) you will get the higher one. No rounding will occur.

 

Budgeting your time 

Plan on spending roughly at least 3 hours/week in class, 3 hours/week working on your project, and 3 hours/week studying for class.

 

Term project: 

You will have to give a 5-7 minute presentation describing your project. This will substitute for the HW due the day you give the presentation.

 

You will also need to give me a demonstration before the end of exam week.

 

Some possible topics: A prototype reader interaction system demonstrating a combination of models that has not been done before (see http://class.ee.iastate.edu/berleant/home/me/cv/papers/cikmBerleant2000.htm - a 2-page short paper for submission to the ACM Hypertext conference would be part of the project, see http://class.ee.iastate.edu/berleant/home/me/cv/papers/ht04.pdf for an example paper about a system built mainly by senior design students); a Web site usability analysis - this would not necessarily involve programming (check with instructor for details); a Web server log navigation aid or analyzer; information extraction based Web browser; some other kind of Web browser - should have something novel about it; a general-purpose search engine result extractor - given the returned results from "any" search engine, extract out a list of the URLs it returned, removing everything else (comments, advertisements, etc.); a special purpose search engine.

 

Topics that may be covered (notes are updated for this semester from the previous semester by class time)

Note: if you bring a topic, paper, book section, or URL(s) to my attention and it’s relevant we can probably read and discuss it in class. This is encouraged!

 

M 1/10/05: Review of course;  HW due next time (ser. #1)

W 1/12/05: Discuss the head-tail display paper and the outliner and pager systems (ser. #2)

                   Head-tail display talk

                   HW for next time: read the paper "Models for reader interaction systems"

                          Historical notes: More questions about the course; Language form; Quiz in 2004

F 1/19/05: Models for Reader Interaction Systems. (ser. #23)

                 HW: for next time 

W 1/19/05: Information Extraction and Extracting Browsers (ser. #3)

                  Fill out student information sheet

                  HW for next time

F 1/21/05:  Discuss Byrd (ser. #4)

                  HW for next time

M 1/24/05  Discuss TileBars: compare with head-tail display (ser. #5)

                   Searching the Web with Speckled TileBars

                   HW for next time: Check out http://www.pnl.gov/infoviz/ and provide written questions and comments.

                   Continue work on term project and provide summary.

W 1/26/05: Survey of text visualization systems (ser. #6)

                   HW for next time

F 1/28/0: Baeza-Yates and Ribeiro-Neto chapter 10 (by Hearst) (ser. #7)

               HW for next time

Survey on course work load

M 1/31/05: Web browsers (ser. #8)  

                  HW for next time

W 2/2/05: A simple Web browser, and the code, and a sample HW found on the Web (don't do it)(ser. #9)

                 HW due Monday (not next time); Don't do this HW from last time

F 2/4/05: Notes on Search engines(ser. #10)

                   For next time: read textbook on indexing and searching, sections 8.1, 8.2, & 8.3; don't do this HW  from last year

M 2/7/05: Inverted Files/Indexes/Indices/Lists(ser. #11)

                   HW for next time: see top of  lecture notes

W 2/9/05: Finish Inverted Files/Indexes/Indices/Lists (ser. #12)

                HW due Friday 2/11/05:  review section 8.3. Describe what you have done

                on the project since last report, and what you plan to do next.

F 2/11/05 : Suffix trees for searching and indexing(ser. #13)

                   HW For next time (see course notes, handout, and textbook section 8.3)

                      (1) Report on your project: what have you done since Monday 2/9/04 and what is planned next?

                      (2) Draw a suffix tree for the string "The sky is blue, blue is the blue sky"
                            but ignore case and punctuation

                      (3) What takes less space for the same data, a patricia tree or a trie?

M 2/14/05: Query languages(ser. #14)

                  HW

                  Instructor notes

W 2/16/05: Retrieval models instructor notes (ser. #15)

                  HW for next time: about the project, state: (What you did since Monday,

                  giving details, code if available, etc.) OR (details of plans for what you

                  plan to do next).

F 2/18/05: The vector model; see 2nd half of previous lecture (starting with "section 2.5.3"), and these notes (ser. #16)

                 HW: prepare a 5-minute presentation per stndent on your project for next Friday.

                 Use transparencies and print on them with a printer (not by hand). Cover

                 what the project goal is, what you have done so far, and some technical

                detail but not too much since you only have 5 minutes. You will hand in the

                 transparencies later.

M 2/21/05: The vector model's cosine function and other comparison functions (ser. #16b)

                  HW for next Monday

W 2/23/05: Searching and Serving Biological Texts: Generalizing the Lessons (ser. #85)

F 2/25/05: Student presentations on projects, 5 min. per student (2-person projects: 10 min. total)

M 2/28/05: Discuss Web search engines, chapter 13, sections 13.1-13.4;  (ser. #18) background - you may wish to see for example the original Google papers (on citeseer) by Brin and Page.
                  HW for next time: read sections 13.1 through 13.4. Provide written comments on each section. Total should be about 1 page (11 pt. font).

W 3/2/05: Search engines (ranking and crawling algorithms, Baez-Yates sections 13.4.4-13.4.5)  (ser. #18a)

                 Instructor note (offline)

                 HW for next time: read sections 13.4.4-13.4.5, providing written comments. Give progress report on your project.

F 3/4/05: Google and its PageRank algorithm: "The anatomy of a large-scale hypertextual Web search engine," e.g. http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm  (ser. #19)

                 HW for Monday: handed out in class. Old HW from previous years - surf the Web to find out about co-clustering; hand in 10 bullet-style interesting facts that you have found (or a 1-page coherent discussion); be ready to state a fact or two to the class. I will also search the Web, but will collect more than 10 bullet points.

M 3/7/05: Co-clustering (ser. #19a); PathBinder, an example of Text processing in bioinformatics (ser. #19g)  and hardcopy of demo (ser. #19g1)

W 3/9/05: Recall, Precision, Effectiveness, and Averages  (ser. #19f) Measures of information retrieval performance; (ser. #30)  (2002's quiz)

HW for next time

F 3/11/05: Ontologies and the Semantic Web; based on Hendler, Agents and the Semantic Web. On-line resources include www.w3.org/2001/sw, www.daml.org, www.cs.umd.edu/projects/plus/SHOE, and www.semanticweb.org.  (ser. #19b)  * if there is more interest on agents, here would be a good time to schedule another class on it

                Last year's HW: Read up on RDF and/or OWL. Make a list of 7 facts and 2 questions to hand in.

M 3/21/05: RDF and OWL (ser. #19c)

                HW for next time: find out what wrappers are, and describe them. Find out what wrapper induction is, and describe it. You can use the Web. Some sources you might consider using are: see Adelberg and Denny "Building robust wrappers for text sources"; Hammer, Garcia-Molina, Aranha, Crespo, Extracting semistructured information from the web, Kushmerick Weld, Doorenbos "wrapper induction for information extraction" etc.

W 3/23/05: Wrappers and wrapper induction  (ser. #19d)- see Kushmerick Weld, Doorenbos "wrapper induction for information extraction," see also Adelberg and Denny "Building robust wrappers for text sources"; Hammer, Garcia-Molina, Aranha, Crespo, Extracting semistructured information from the web,  etc. * Here would be a good place for a class on intelligent Web mining (e.g. for biological interactions?)

                    HW: read paper handed out for next time. Provide written comments.

F 3/25/05: ShopperBots (ser. #19e)

               HW: Report on what you have done on your term project since last report. Give details. Describe what you plan to do next.

 

New Unit: Information Customization and Personalization
 

M 3/28/05: Preliminaries - Interaction Instruments (ser. #19h)

                         HW: read the paper, handed out in class, and hand in written comments on it for next time
W 3/30/05: Continuation of last time's discussion.  (ser. #19i)

             HW: for two different text editors, pick some functionality and rate how good the user interface

              is based on the Schneiderman principles discussed in the course notes. Then suggest some improvement.

             Also, for your project, describe status and what is next.

F 4/1/05: Advanced interfaces, Web search, and MultiBrowser;  (ser. #19j) Naren suggests the following related sources: Dialogue, e.g. AMITIES, CLARITY, CONVERSE; my survey on personalizing interactions w/ information systems http://people.cs.vt.edu/~ramakris/papers/piis.pdf(the section 4 is relevant)* personalization and mixed initiative interaction http://people.cs.vt.edu/~ramakris/papers/itpro.pdf* SALT websitehttp://www.saltforum.org/default.asp* you might also be interested in the following paper that came in IUI: http://www.cs.washington.edu/homes/pedrod/papers/iui01.pdf
             HW for next time: Scan sections 1 and 4 of http://people.cs.vt.edu/~ramakris/papers/piis.pdf

             for useful information. Then either (1) describe a possible way to improve some existing

             Web or other tool based on the paper, or (2) describe a way to change the specs of your term

             project based on the paper (but you don't have to implement it - this is just a HW assignment),

            or (3) list 4 things in the paper that you found interesting, and why.
M 4/4/05: Information Personalization (ser. #19k)

            HW for next time: read up on http://www.loebner.net/Prizef/loebner-prize.html.

            Try out one of the systems and hand in a transcript and your personal observations/thoughts.

W 4/6/05:  Human-Computer Dialogue, the Turing Test, and the Loebner Prize; (ser. #19L) N.K. writes: I found a free alicebot java implementation - you can even configure this to log on to AOL's instant messenger, and talk with it over that! http://www.alicebot.org/downloads/ The java program is in "Program D" - you'll also need to download the Standard (or Anna if you want) AIML to give it "brains". D.K. writes: interesting URL about an experiment on Alicebot http://www.nik.com.au/alice/.
           HW for next time: prepare a 5 minute presentation on either the Turing test, the AI singularity,

           or describing how system(s) could be designed to meet one or more "Information Seeking

           Strategies" (see table 1 in http://people.cs.vt.edu/~ramakris/papers/piis.pdf). Hand in any

           presentation notes, transparencies, or other items you used in your presentation to Hu Lan.

F 4/8/05: Student presentations, proctored by Hu Lan, who will time the presentations a collect the materials.

 

New section of course: text parsing and analysis

 

M 4/11/05: String matching and the Boyer-Moore algorithm;  (ser. #73) see http://www-igm.univ-mlv.fr/~lecroq/string/ for more information.

          HW for next time: Give a trace of the Boyer-Moore algorithm

          for the pattern "ions" and the text "ionizations"

W 4/13/05: Parsing and regular expressions (ser. #25)
          HW for next time: write a regular expression for integers (base 10). Also, report on what you did on your

          project since the last report, and what you plan to do next. Also, schedule with me a time to demo your project

          for up to 10 minutes by the end of APRIL.

F 4/15/05: Parsing and regular expressions II;  (ser. #26)

          HW for next time: draw a machine that recognizes an integer. Also, do something related to your project and report what it was.

M 4/18/05: Parsing and regular expressions III;  (ser. #27)

          HW is this

W 4/20/05: Parsing IV: beyond regular expressions.  (ser. #28)

          HW for next time is to: write a context-free grammar for arithmetic expressions.

F 4/22/05: Parsing V.  (ser. #29)

 

Other Topics

 

M 4/25/05: Observe 485 demos in the 485 lab.

         HW for next time: The Web was exactly what Ted Nelson was trying to *prevent*. By surfing the Web,

         find out 5 interesting things about Project Xanadu and/or Ted Nelson.

         (Old HW: read about Xanadu - http://www.xanadu.net, everything it links to, and skim http://www.sfc.keio.ac.jp/~ted/XUsurvey/xuDation.html)

 

========The current lecture is either the next one, or if not, is nearby ==========

 

W 4/27/05: The Xanadu vision; a paper (ser. #36)

F 4/28/05: Project presentations, 5 minutes per person

 

Here is what I need regarding your project:

- hard copy of the source code

- email containing the code

- list of features (about 1 page or less)

- instructions for running it

- *brief* list of things you learned from the project

- comments about your teammate's contributions, if they

did significantly more than their share or significantly less

- copies of your transparencies (I already have this from some of you)

- a demo of the program. Email me to schedule when. Anytime

 will do if I am available at that time.

THE END!

 

 

Supplementary Notes:

(37)M 4/26/04: Xanadu technical details (ser. #37)

(34) HW for today: read the passage handed out from "World Brain," by H.G. Wells. Hand in some observation, question, or comment on the content. Keep working on your projects.

Introduction to digital libraries: H. G. Wells and Vannevar Bush; the paper "As We May Think"  

(84) W 4/21/04: Presentation by Guy Howard; example: versions of ISU's home pageversioning and archiving the Web

(31) Information retrieval performance - interactive systems

(30a) : Presentations by N. Korba, D. Davidson; guest lecture by Jinghao Miao on the color bar user study and its relevance to MultiBrowser

(16a) Advanced IR Models, sections 2.7-2.7.3

(17) Advanced IR Models II

           HW to prepare for (20)

(20) W 3/6/03: Web caching. 2002 notes on one paper. (last year's notes on another paper)

            HW: give progress report on project: what did you do since last Friday? What will you do next?

            Also read the paper we discussed more carefully (you can download it from ResearchIndex).

            We will discuss it more next time.

(21) F 3/8/02:  Discuss more about last time's paper. 

            HW for Monday: read  Document authentication and permanence service

(22) M 3/11/02: Document authentication and permanence service (aka Web version archiving)  

             HW  for Wednesday and Friday and the (offline) solution 

(24) F 3/15/02: Napster, gnutella, file sharing, and the future; Web Tracking if time

 

(32) W 4/10/02: J. Ding, D. Berleant, D. Nettleton, and E. Wurtele, Mining MEDLINE: Abstracts, Sentences, or Phrases?, Pacific Symposium on Biocomputing (PSB 2002), Kaua'i, Hawaii, Jan. 3-7, pp. 326-337. 

(33) F 4/12/02: The role of "verbs" in mining MEDLINE

        HW: go out onto the Web and find one interesting thing about digital libraries. Print it out on a sheet of paper, with the URL, in big font. Slide it under my door by 10:45 Monday. I will copy onto transparency. You will help me present your fact in class.

(35) W 4/17/02: Digital Libraries, chapter 15 (some links messed up by Explorer - they're not messed up in the html source. weird)

        HW for Friday: work on project, read about Xanadu - http://www.xanadu.net, everything it links to, and skim http://www.sfc.keio.ac.jp/~ted/XUsurvey/xuDation.html

(37) M 4/22/02:  Xanadu II (see notes 36)

(38) W 4/24/02: Project presentation by Haider Q., Ali M. (Multisearch engine browser)

                          XML (as time allows)

 

Sentences vs. phrases for biomolecular interaction extraction http://class.ee.iastate.edu/berleant/home/me/cv/papers/Presentation.htm 

(44) Email me a "transparency" (about 10 1-line large font facts) discussing a document-related acronym, term, or concept of your choice (see e.g. http://www.w3.org/) and be prepared to speak from it in class. Document-related acronyms, terms, and concepts and a pdf intro

(45) Relevance feedback, etc.

(46) Notes on Theng et al.

(47) Notes on Hong et al.; HW

(48) More on Web search engines - will the Web get too big for search engines to handle? User study, spring 2001

(49)Word frequencies and word frequencies on the Web

(50) Electronic paper

(51) Content permanence

 

(52) String frequency analysesDiscuss term frequency issues including Zipf (small font version here); HW for next time

(52) Direct Multidisplay (WISER and MultiBrowser)

(52) Html and home pages, 12/11/95

(52) Future of the Web, 12/4/95

(52) Parliamentary procedure for email meetings lab, 12/1/95

(53) Parliamentary procedure for email meetings, 11/29/95

(54) Importance of macroprocessors, automated abstract generation, machine translation and rules of order for email meetings, 11/27/95

(55) Concordances and collocations: work at U. of A., 11/15/95

(56) Implementation of a concordance generator, 11/10/95

(57) Concordances and collocations, 11/8/95

(58) More on spell checking and correcting, 11/6/95

(59) Spell checking and correcting, 10/30/95

(60) A new approach to spell checking and correcting, 10/27/95

(61) Cyberspace and security, 10/25/95, plus HW due Friday

(62) Trapdoor based encryption, 10/23/95

(63) Future technology in Japan, 10/20/95

(64) Data Encryption Standard (DES), 10/18/95

(65) Text Processing class HW due 10/18/95

(66) Encryption and decryption, 10/16/95

(67) Plans, assignments, and quiz schedule (starting 10/6/95)

(68) Round table discussion, 10/4/95

(69) Text Processing HW for 10/4/95

(70) Lempel-Ziv compression, 10/2/95

(71) Walkthroughs, 9/29/95

(72) Compression, 9/25/95

(74) Walkthroughs, 9/20/95

(75) Text Processing class, HW handed out 9/18/95

(76) Editor design and implementation, 9/18/95

(77) Advanced types of editors, 9/15/95

(78) Taxonomy of text editors, 9/13/95

(79) Text Processing class, HW3

(80) Information retrieval II, 9/11/95

(81) Storage and indexing, 9/8/95

(82) Results of research and course intro, 9/6/95

(83) An in-class mini-research project, 9/1/95

Additional potential topics

Searching the Web (chapter 13) esp. Search Engines (13.4)

Multimedia IR: Models & Languages (chapter 11)

User Interfaces and Visualization (chapter 10)

Text operations (chapter 7) esp. Text Compression (7.4) esp. Statistical Methods (7.4.3) Also lexical analysis (7.2.1)

JavaScript

Perl

XML

Multimedia IR: indexing and searching (chapter 12)

Parallel and Distributed IR (chapter 9) esp. not for searches

Statistical text analysis

…and many more

Text mining and parallel programming

Artificial life and complex adaptive systems - relation to text mining

Differences/similarities between text mining and data mining

Document similarity measurement

Evaluating quality of search engines

Future technologies

HTTP protocol (Get, Post, etc.)

IR as it pertains to the Web in particular

Synonym and other approximate word matching

 

Answers to some common questions about the course

 

Background knowledge

 

Q: Is knowledge of Java or any other particular programming language required?

A: No.

 

Q: What kind of statistics background is required?

A: None. But we may discuss statistics a bit.

 

Projects

 

Q: How big can project teams be?

A: 1, 2, or 3 definitely. For 4 (or more) see me first.

 

Q: What is expected of a team project?

A: The bigger the team, the bigger the project should be.

 

Q: What deliverables will be required for the project?

A: Progress reports, specifications, design, and a working system with hard and soft copy of the code and instructions to run it.

 

Quizzes and Tests

 

Q: Will quizzes be announced?

A: Sometimes yes, sometimes (often) no. Unannounced quizzes will be easy if you've done the assigned reading.

 

Q: Any midterm or final exams?

A: No.

 

Grading

 

Q: Isn't 95% for an A too high?

A: I'll reduce it if needed to insure that at least 1/3 the class gets an A.

 

Other aspects of the course

 

Q: Does course work deal with internet programming?

A: One hands-on programming exercise, plus your term project.

 

Q: Can I sit in without registering?

A: See me.

 

Q: Will any programming languages be discussed?

A: We may do a week of Perl, and maybe a little Javascript. No Java (take my other course for that!).

 

Q: How will breadth and depth of coverage be balanced?

A: We will cover technical details of some important topics, and survey some others

 

Answers to more questions about the course

 Q: What are the grading criteria for the term project?

A: Mostly, like in any other course, except as described in the syllabus, and

- for group projects, team  members will be offered the opportunity to assess the contributions of the other members, and this will be taken into account

Q: Will grading be on a curve?

A: It's mapped directly to your percentage (see syllabus). I reserve the right to curve if it *increases* grades, however. I hope and anticipate everyone who makes a reasonable effort on the assignments would get B or better. See syllabus for more details.

Q: Do we have to document our project?

A: No large paper is needed, but instructions for running the program must be included, and I will ask for specs and designs in the middle of the semester.

Q: When is the due date for project proposal?

A: There will be frequent due dates for incremental progress. Example: today.

Q: HWs are 100 pts. total?

A: No, each. Many students like lots of points.

Q: Quizzes on reading - what if you do the reading, but didn't understand it?

A: Quizzes will be shallow. If they were after discussion they would be deeper.

Q: What kind of project is suitable for someone who knows little C/C++?

A: Specifications should include a fairly basic foundation, and plenty of "if time allows" additional parts.

A: You can change your mind about what to do any time.

Q: What about text mining in a broadband context where the material is actually multimedia?

A: As long as text is part of the material, you can text mine it.

Q: What dist. of points will there be between HW and quizzes?

A: Very roughly 50-50. Purpose of HWs is to develop understanding of concepts, purpose of quizzes is to prepare you to understand and discuss in class.

Q: Non-textbook reference availability?

A: Passed out in class or available on Web. (Regular library materials too.)

 

Other Useful Links

 

some natural language links

A company that sells a text analysis IDE: http://www.textanalysis.com/ 

Parsing of natural language: 

       http://www.cs.kun.nl/agfl/ this link will take you to the AGFL home page.

The Spelling Checker and Corrector, 11/1/95

Web site of an information retrieval course.

Web site of a course on data and information management