全文检索 — Lucene_02（分析器_IKAnalyzer(中文)、索引库的维护_添加文档_删除文档_修改文档、索引库的查询_TermQuery_RangeQuery_QueryParser）

Author： ruki_in_HDU
发布时间：November 22, 2019
2791views
No comments
10706 words
Categories： Tech

六、分析器

默认使用的标准分析器StandardAnalyzer

6.1 查看分析器的分析效果

使用Analyzer对象的tokenStream方法，返回一个TokenStream对象。此对象中包含了最终分词结果。

6.1.1 实现步骤

创建一个Analyzer对象，具体为子类StandardAnalyzer对象
使用分析器对象的tokenStream方法获得一个TokenStream对象
向TokenStream对象中设置一个引用，相当于是一个指针
调用TokenStream对象的reset方法。（不调用会抛异常）
使用while循环遍历TokenStream对象
关闭TokenStream对象

    @Test
    public void testTokenStream() throws Exception {
        //1. 创建一个`Analyzer`对象，具体为子类`StandardAnalyzer`对象
        Analyzer analyzer = new StandardAnalyzer();
        //2. 使用分析器对象的`tokenStream`方法获得一个`TokenStream`对象
        TokenStream tokenStream = analyzer.tokenStream("", "The Spring Framework provides a comprehensive programming and configuration model.");
        //3. 向`TokenStream`对象中设置一个引用，相当于是一个指针
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4. 调用`TokenStream`对象的`reset`方法。（不调用会抛异常）
        tokenStream.reset();
        //5. 使用`while`循环遍历`TokenStream`对象
        while (tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6. 关闭`TokenStream`对象
        tokenStream.close();
    }

6.2 IKAnalyzer的使用方法（中文分析器）

6.2.1 实现步骤

把IKAnalyzer的jar包添加到工程中
把配置文件和扩展词典添加到工程的classpath下
- 注意：扩展词典严禁使用windows记事本编辑，保证扩展词典的编码格式是utf-8
- 扩展词典：添加一些新词
- 停用词词典：无意义的词或者是敏感词汇

    @Test
    public void testTokenStream() throws Exception {
        //1. 创建一个`Analyzer`对象，具体为子类`StandardAnalyzer`对象
//        Analyzer analyzer = new StandardAnalyzer();
        Analyzer analyzer = new IKAnalyzer();//使用IK中文分析器
        //2. 使用分析器对象的`tokenStream`方法获得一个`TokenStream`对象
        TokenStream tokenStream = analyzer.tokenStream("", "全文检索是将整本书java、整篇文章中的任意内容信息查找出来的检索，java。它可以根据需要获得全文中有关章、节、段、句、词等信息，计算机程序通过扫描文章中的每一个词，对每一个词建立一个索引，指明该词在文章中出现的次数和位置，当用户查询时根据建立的索引查找，类似于通过字典的检索字表查字的过程。");
        //3. 向`TokenStream`对象中设置一个引用，相当于是一个指针
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4. 调用`TokenStream`对象的`reset`方法。（不调用会抛异常）
        tokenStream.reset();
        //5. 使用`while`循环遍历`TokenStream`对象
        while (tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6. 关闭`TokenStream`对象
        tokenStream.close();
    }

创建索引时，使用自定义分析器

    //2. 基于Directory对象创建一个IndexWriter对象
    IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
    IndexWriter indexWriter = new IndexWriter(directory, config);

@Test
    public void createIndex() throws Exception {
        //1. 创建一个Directory对象，指定索引库保存的位置。
        //把索引库保存在内存中
        //Directory directory = new RAMDirectory();
        //把索引库保存在磁盘中
        Directory directory = FSDirectory.open(new File("F:\\lucenetest\\index").toPath());
        //2. 基于Directory对象创建一个IndexWriter对象
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        IndexWriter indexWriter = new IndexWriter(directory, config);
//        IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());
        //3. 读取磁盘上的文件，对应每个we年创建一个文档对象
        File dir = new File("F:\\Java\\12-lucene\\02.参考资料\\searchsource");
        File[] files = dir.listFiles();
        for (File file : files) {
            //获取文件名
            String fileName = file.getName();
            //获取文件路径
            String filePath = file.getPath();
            //获取文件内容
            String fileContent = FileUtils.readFileToString(file, "utf-8");
            //获取文集大小
            long fileSize = FileUtils.sizeOf(file);
            //创建Field
            //参数1：域的名称；参数2：域的内容；参数3：是否存储
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            Field fieldPath = new TextField("path", filePath, Field.Store.YES);
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
            Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
            //创建文档对象
            Document document = new Document();

        //4. 向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);
        //5. 把文档对象写入索引库
            indexWriter.addDocument(document);
        }

        //6. 关闭indexwriter对象
        indexWriter.close();
    }

七、索引库的维护

7.1 Field域的属性

是否分析：是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。
是否索引：将Field分析后的词或整个Field值进行索引，只有索引才能搜索到。
是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取。

Field类	数据类型	Analyzed 是否分析	Indexed 是否索引	Stored 是否存储	说明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	这个Field用来构建一个字符串Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号,姓名等) 是否存储在文档中用Store.YES或Store.NO决定
LongPoint(String name, long... point)	Long型	Y	Y	N	可以使用LongPoint、IntPoint等类型存储数值类型的数据。让数值类型可以进行索引。但是不能存储数据，如果想存储数据还需要使用StoredField。
StoredField(FieldName, FieldValue)	重载方法，支持多种类型	N	N	Y	这个Field用来构建不同类型Field 不分析，不索引，但要Field存储在文档中
TextField(FieldName, FieldValue, Store.NO) 或 TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略.

优化

Field fieldName = new TextField("name", fileName, Field.Store.YES);
//Path不需要分词，不需要索引，只需要存着就行
Field fieldPath = new StoredField("path", filePath);
Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
//文件大小这个属性之后需要范围查找，字符串形式存储不合适，应该为数值类型，如需存储，则新建StoredSize
Field fieldSizeValue = new LongPoint("size", fileSize);
Field fieldSizeStore = new StoredField("size", fileSize);

7.2 索引库的维护

7.2.1 添加文档

创建一个IndexWriter对象，需要使用IKAnalyzer作为分析器
创建一个Document对象
向document对象中添加域
把文档写入索引库
关闭索引库

    @Test
    public void addDocunment() throws Exception {
        //1. 创建一个IndexWriter对象，需要使用IKAnalyzer作为分析器
        IndexWriter indexWriter = new IndexWriter(
                FSDirectory.open(new File("F:\\lucenetest\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer())
        );
        //2. 创建一个Document对象
        Document document = new Document();
        //3. 向document对象中添加域
        document.add(new TextField("name", "新添加的文件名", Field.Store.YES));
        document.add(new TextField("content", "新添加的文件内容", Field.Store.NO));
        document.add(new StoredField("path", "F:\\Java\\12-lucene\\02.参考资料\\searchsource"));
        //4. 把文档写入索引库
        indexWriter.addDocument(document);
        //5. 关闭索引库
        indexWriter.close();
    }

7.2.2 删除文档

删除全部

删除全部文档
关闭索引库

    @Test
    public void deleteAllDocument() throws Exception {
        //删除全部文档
        indexWriter.deleteAll();
        //关闭索引
        indexWriter.close();
    }

删除指定关键词文档

    @Test
    public void deleteAllDocumentByQuery() throws Exception {
        indexWriter.deleteDocuments(new Term("name", "apache"));
        indexWriter.close();
    }

7.2.3 修改文档

原理：先查询再修改

    @Test
    public void updateDocument() throws Exception {
        //1. 创建一个新的文档对象
        Document document = new Document();
        //2. 向文档对象中添加域
        document.add(new TextField("name", "更新之后的文档", Field.Store.YES));
        document.add(new TextField("content", "更新之后的文档内容", Field.Store.YES));
        //更新操作。
        //查出name中带有spring 关键字的文档，用新的document替换
        indexWriter.updateDocument(new Term("name", "spring"), document);
        //关闭索引库
        indexWriter.close();
    }

八、索引库的查询

8.1 使用Query的子类

8.1.1 TermQuery

根据关键词进行查询
- 需要指定要查询的域以及要查询的关键词

8.1.2 RangeQuery

范围查询

    @Test
    public void testRangeQuery() throws Exception {
        //创建一个Query对象
        Query query = LongPoint.newRangeQuery("size", 0l, 100l);
        //执行查询
        TopDocs topDocs = indexSearcher.search(query, 10);
        System.out.println("总记录数：" + topDocs.totalHits);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            //取文档id
            int docId = scoreDoc.doc;
            //根据文档id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            System.out.println("-------------------------------");
        }
        indexReader.close();
    }

8.2 QueryParser

对要查询的内容先分词，然后基于分词的结果进行查询
- 添加jar包：lucene-queryparser-7.4.0.jar
使用步骤
1. 创建一个QueryParser对象。两个参数
  1. 参数1：默认搜索域
  2. 参数2：分析器对象
2. 使用QueryParser对象，创建一个Query对象
3. 执行查询

    @Test
    public void testQueryParser() throws Exception {
        //1. 创建一个`QueryParser`对象。两个参数
            //1. 参数1：默认搜索域
            //2. 参数2：分析器对象
        QueryParser queryParser = new QueryParser("name", new IKAnalyzer());
        //2. 使用`QueryParser`对象，创建一个`Query`对象
        Query query = queryParser.parse("lucene是一个java开发的全文检索工具包");
        //3. 执行查询
        printResult(query);
    }

    public void printResult(Query query) throws Exception {
        //执行查询
        TopDocs topDocs = indexSearcher.search(query, 10);
        System.out.println("总记录数：" + topDocs.totalHits);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            //取文档id
            int docId = scoreDoc.doc;
            //根据文档id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            System.out.println("-------------------------------");
        }
        indexReader.close();
    }

Last modification：November 22nd, 2019 at 03:18 pm

全文检索 — Lucene_02（分析器_IKAnalyzer(中文)、索引库的维护_添加文档_删除文档_修改文档、索引库的查询_TermQuery_RangeQuery_QueryParser）

ruki_in_HDU • 2019 年 11 月 22 日

六、分析器

默认使用的标准分析器StandardAnalyzer

6.1 查看分析器的分析效果

使用Analyzer对象的tokenStream方法，返回一个TokenStream对象。此对象中包含了最终分词结果。

6.1.1 实现步骤

创建一个Analyzer对象，具体为子类StandardAnalyzer对象
使用分析器对象的tokenStream方法获得一个TokenStream对象
向TokenStream对象中设置一个引用，相当于是一个指针
调用TokenStream对象的reset方法。（不调用会抛异常）
使用while循环遍历TokenStream对象
关闭TokenStream对象

    @Test
    public void testTokenStream() throws Exception {
        //1. 创建一个`Analyzer`对象，具体为子类`StandardAnalyzer`对象
        Analyzer analyzer = new StandardAnalyzer();
        //2. 使用分析器对象的`tokenStream`方法获得一个`TokenStream`对象
        TokenStream tokenStream = analyzer.tokenStream("", "The Spring Framework provides a comprehensive programming and configuration model.");
        //3. 向`TokenStream`对象中设置一个引用，相当于是一个指针
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4. 调用`TokenStream`对象的`reset`方法。（不调用会抛异常）
        tokenStream.reset();
        //5. 使用`while`循环遍历`TokenStream`对象
        while (tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6. 关闭`TokenStream`对象
        tokenStream.close();
    }

6.2 IKAnalyzer的使用方法（中文分析器）

6.2.1 实现步骤

把IKAnalyzer的jar包添加到工程中
把配置文件和扩展词典添加到工程的classpath下
- 注意：扩展词典严禁使用windows记事本编辑，保证扩展词典的编码格式是utf-8
- 扩展词典：添加一些新词
- 停用词词典：无意义的词或者是敏感词汇

    @Test
    public void testTokenStream() throws Exception {
        //1. 创建一个`Analyzer`对象，具体为子类`StandardAnalyzer`对象
//        Analyzer analyzer = new StandardAnalyzer();
        Analyzer analyzer = new IKAnalyzer();//使用IK中文分析器
        //2. 使用分析器对象的`tokenStream`方法获得一个`TokenStream`对象
        TokenStream tokenStream = analyzer.tokenStream("", "全文检索是将整本书java、整篇文章中的任意内容信息查找出来的检索，java。它可以根据需要获得全文中有关章、节、段、句、词等信息，计算机程序通过扫描文章中的每一个词，对每一个词建立一个索引，指明该词在文章中出现的次数和位置，当用户查询时根据建立的索引查找，类似于通过字典的检索字表查字的过程。");
        //3. 向`TokenStream`对象中设置一个引用，相当于是一个指针
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4. 调用`TokenStream`对象的`reset`方法。（不调用会抛异常）
        tokenStream.reset();
        //5. 使用`while`循环遍历`TokenStream`对象
        while (tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6. 关闭`TokenStream`对象
        tokenStream.close();
    }

创建索引时，使用自定义分析器

    //2. 基于Directory对象创建一个IndexWriter对象
    IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
    IndexWriter indexWriter = new IndexWriter(directory, config);

@Test
    public void createIndex() throws Exception {
        //1. 创建一个Directory对象，指定索引库保存的位置。
        //把索引库保存在内存中
        //Directory directory = new RAMDirectory();
        //把索引库保存在磁盘中
        Directory directory = FSDirectory.open(new File("F:\\lucenetest\\index").toPath());
        //2. 基于Directory对象创建一个IndexWriter对象
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        IndexWriter indexWriter = new IndexWriter(directory, config);
//        IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());
        //3. 读取磁盘上的文件，对应每个we年创建一个文档对象
        File dir = new File("F:\\Java\\12-lucene\\02.参考资料\\searchsource");
        File[] files = dir.listFiles();
        for (File file : files) {
            //获取文件名
            String fileName = file.getName();
            //获取文件路径
            String filePath = file.getPath();
            //获取文件内容
            String fileContent = FileUtils.readFileToString(file, "utf-8");
            //获取文集大小
            long fileSize = FileUtils.sizeOf(file);
            //创建Field
            //参数1：域的名称；参数2：域的内容；参数3：是否存储
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            Field fieldPath = new TextField("path", filePath, Field.Store.YES);
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
            Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
            //创建文档对象
            Document document = new Document();

        //4. 向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);
        //5. 把文档对象写入索引库
            indexWriter.addDocument(document);
        }

        //6. 关闭indexwriter对象
        indexWriter.close();
    }

七、索引库的维护

7.1 Field域的属性

是否分析：是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。
是否索引：将Field分析后的词或整个Field值进行索引，只有索引才能搜索到。
是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取。

Field类	数据类型	Analyzed 是否分析	Indexed 是否索引	Stored 是否存储	说明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	这个Field用来构建一个字符串Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号,姓名等) 是否存储在文档中用Store.YES或Store.NO决定
LongPoint(String name, long... point)	Long型	Y	Y	N	可以使用LongPoint、IntPoint等类型存储数值类型的数据。让数值类型可以进行索引。但是不能存储数据，如果想存储数据还需要使用StoredField。
StoredField(FieldName, FieldValue)	重载方法，支持多种类型	N	N	Y	这个Field用来构建不同类型Field 不分析，不索引，但要Field存储在文档中
TextField(FieldName, FieldValue, Store.NO) 或 TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略.

优化

Field fieldName = new TextField("name", fileName, Field.Store.YES);
//Path不需要分词，不需要索引，只需要存着就行
Field fieldPath = new StoredField("path", filePath);
Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
//文件大小这个属性之后需要范围查找，字符串形式存储不合适，应该为数值类型，如需存储，则新建StoredSize
Field fieldSizeValue = new LongPoint("size", fileSize);
Field fieldSizeStore = new StoredField("size", fileSize);

7.2 索引库的维护

7.2.1 添加文档

创建一个IndexWriter对象，需要使用IKAnalyzer作为分析器
创建一个Document对象
向document对象中添加域
把文档写入索引库
关闭索引库

    @Test
    public void addDocunment() throws Exception {
        //1. 创建一个IndexWriter对象，需要使用IKAnalyzer作为分析器
        IndexWriter indexWriter = new IndexWriter(
                FSDirectory.open(new File("F:\\lucenetest\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer())
        );
        //2. 创建一个Document对象
        Document document = new Document();
        //3. 向document对象中添加域
        document.add(new TextField("name", "新添加的文件名", Field.Store.YES));
        document.add(new TextField("content", "新添加的文件内容", Field.Store.NO));
        document.add(new StoredField("path", "F:\\Java\\12-lucene\\02.参考资料\\searchsource"));
        //4. 把文档写入索引库
        indexWriter.addDocument(document);
        //5. 关闭索引库
        indexWriter.close();
    }

7.2.2 删除文档

删除全部

删除全部文档
关闭索引库

    @Test
    public void deleteAllDocument() throws Exception {
        //删除全部文档
        indexWriter.deleteAll();
        //关闭索引
        indexWriter.close();
    }

删除指定关键词文档

    @Test
    public void deleteAllDocumentByQuery() throws Exception {
        indexWriter.deleteDocuments(new Term("name", "apache"));
        indexWriter.close();
    }

7.2.3 修改文档

原理：先查询再修改

    @Test
    public void updateDocument() throws Exception {
        //1. 创建一个新的文档对象
        Document document = new Document();
        //2. 向文档对象中添加域
        document.add(new TextField("name", "更新之后的文档", Field.Store.YES));
        document.add(new TextField("content", "更新之后的文档内容", Field.Store.YES));
        //更新操作。
        //查出name中带有spring 关键字的文档，用新的document替换
        indexWriter.updateDocument(new Term("name", "spring"), document);
        //关闭索引库
        indexWriter.close();
    }

八、索引库的查询

8.1 使用Query的子类

8.1.1 TermQuery

根据关键词进行查询
- 需要指定要查询的域以及要查询的关键词

8.1.2 RangeQuery

范围查询

    @Test
    public void testRangeQuery() throws Exception {
        //创建一个Query对象
        Query query = LongPoint.newRangeQuery("size", 0l, 100l);
        //执行查询
        TopDocs topDocs = indexSearcher.search(query, 10);
        System.out.println("总记录数：" + topDocs.totalHits);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            //取文档id
            int docId = scoreDoc.doc;
            //根据文档id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            System.out.println("-------------------------------");
        }
        indexReader.close();
    }

8.2 QueryParser

对要查询的内容先分词，然后基于分词的结果进行查询
- 添加jar包：lucene-queryparser-7.4.0.jar
使用步骤
1. 创建一个QueryParser对象。两个参数
  1. 参数1：默认搜索域
  2. 参数2：分析器对象
2. 使用QueryParser对象，创建一个Query对象
3. 执行查询

    @Test
    public void testQueryParser() throws Exception {
        //1. 创建一个`QueryParser`对象。两个参数
            //1. 参数1：默认搜索域
            //2. 参数2：分析器对象
        QueryParser queryParser = new QueryParser("name", new IKAnalyzer());
        //2. 使用`QueryParser`对象，创建一个`Query`对象
        Query query = queryParser.parse("lucene是一个java开发的全文检索工具包");
        //3. 执行查询
        printResult(query);
    }

    public void printResult(Query query) throws Exception {
        //执行查询
        TopDocs topDocs = indexSearcher.search(query, 10);
        System.out.println("总记录数：" + topDocs.totalHits);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            //取文档id
            int docId = scoreDoc.doc;
            //根据文档id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            System.out.println("-------------------------------");
        }
        indexReader.close();
    }

六、分析器

6.1 查看分析器的分析效果

6.1.1 实现步骤

6.2 IKAnalyzer的使用方法（中文分析器）

6.2.1 实现步骤

七、索引库的维护

7.1 Field域的属性

7.2 索引库的维护

7.2.1 添加文档

7.2.2 删除文档

删除全部

删除指定关键词文档

7.2.3 修改文档

八、索引库的查询

8.1 使用Query的子类

8.1.1 TermQuery

8.1.2 RangeQuery

8.2 QueryParser