This is a code repository for IR Homework by yinita (April-June 2023).
As you can see in there, there are 7 experiment example codes(include the final experiment). In the last few experiments, there are some sensitive data I would not give, so you may need to change it to your own accounts.
- Master the establishment process of Inverted index;
- Master the merging algorithm of inverted record tables;
- Understand the simple invocation of Lucene's open source information retrieval library.
- Realize the establishment of Inverted index
- Implementing Boolean retrieval algorithms
- Master skip pointers technology, proximity search technology, Edit distance calculation method, pronunciation based correction technology, etc;
- Understand the simple invocation of PorterStemmer open-source programs.
- Given a portion of the location index, please determine which documents match which queries
- Implement an inverted record table merging algorithm based on skip table pointers, and verify the correctness of the algorithm on the example in Experiment 1.
- Implement a merging algorithm for two inverted record tables in neighboring search, and make appropriate improvements according to requirements.
- Implement Dynamic programming algorithm to calculate the Edit distance of two strings
To master the statistical analysis methods of text datasets, as well as the construction and compression techniques of indexes.
- For the 593 documents in the attachment "HW3. txt" (each line represents a document with document IDs ranging from 1 to 593): Use Chinese word segmentation tools (such as Jieba Chinese word segmentation) to segment text; Count the total number of tokens and terms in 593 documents; Build a Inverted index and output the queried document ID
- Implement VB encoding and decoding
- Implement encoding and decoding of Gamma codes
master document scoring and probability based retrieval techniques such as Vector space model, BM25 and Language Model.
-
Calculate the similarity values between 105 English documents in the attachment "HW4_1. txt" and return the 5 documents with the highest similarity for each document in documents 1-10 (if the similarity is the same, the document with the smaller document number will be returned first). It is required to use cosine similarity and TF-IDF to calculate the similarity between documents (keep two digits after the Decimal separator). English words are converted to lowercase without filtering or other conversion.
-
Calculate the similarity values between 593 Chinese documents in the attachment "HW4_2. txt" and return the 5 documents with the highest similarity for each document in documents 1-10 (if the similarity is the same, the document with the smaller document number will be returned first). It is required to use cosine similarity and TF-IDF to calculate the similarity between documents (two digits after the Decimal separator are reserved).
To master the basic classification and clustering algorithms.
-
Implementation of Feature selection method: (1) Implement Feature selection method based on Mutual information; (2) Realize the Feature selection method based on X ^ 2; (3) Use Chinese word segmentation tools to segment documents; (4) Use the 2022 news documents obtained from the school official document (a total of 200); (5) Using five categories as labels, select the 10 most relevant features for each category;
-
Implement naive Bayesian classification algorithm: (1) Implement a simple document classification system based on naive Bayesian classification algorithm; (2) Compare the classification effect of Feature selection and non Feature selection; (3) Train and test using the documents in question (1);
To master classic link analysis algorithms such as PageRank and HITS.
- Implementation of PageRank algorithm
- Calculate PageRank value manually
- HITS algorithm implementation
-
Use open-source Chinese word segmentation programs (such as stuttering word segmentation) to segment Chinese sentences, and provide detailed instructions and configurations.
-
Use the open source English Stemming algorithm (such as the Porter algorithm) to process English documents and queries, and explain how to call related functions.
-
Develop an information retrieval system that supports user interaction, including input boxes and result displays (including hyperlinks), with a simple and aesthetically pleasing interface.
-
Use Inverted index to store data and analyze the Time complexity of Inverted index construction.
-
Use TF-IDF as the weight calculation method, provide Formula, explain the symbols and code implementation in the formula, and analyze Time complexity.
-
Use cosine similarity to calculate the correlation score of<query, document>, provide Formula, explain the symbols and code implementation in the formula, and analyze Time complexity.
-
Sort the matched documents based on the correlation score, return the top 10 results, and display the query time.
-
Real time update of system data, supporting large-scale data download or crawling.
-
Optimize the storage Space complexity of dictionary or inverted record table.
-
Support indexing news attachments.
-
Support displaying non textual content, such as images.
-
Support search function by domain, including news classification for different domains.
-
Provide friendly user interaction, such as font coloring in red, to help users understand the results.
-
Consider other factors besides cosine similarity to sort the results, such as timeliness.
-
Support query suggestions and provide candidate queries.
-
Support query and error correction, identify error terms and provide candidate queries.
-
Other functions that are considered distinctive should be clearly explained.