Docusaurus利用docker执行爬虫程序完成搜索功能
· 5 min read
前提
- 网站由Docusaurus创建
- 注册了Algolia账户(本文使用Github登录)
- 创建过数据源
algolia数据源里创建索引
准备好的数据源中
配置algolia
配置docusaurus.config.js
themeConfig: {
// ...
algolia: {
apiKey: "Admin API Key",
indexName: "刚才创建索引的 name,不是数据源的 name",
appId: "Application ID",
},
}
上述三个配置的查询位置
注意:key应具有相应的权限,建议使用Admin API Key
这时就能看到博客的右上角出现了熟悉的搜索框了
接下来我们继续实现它的搜索功能
Docker 爬取本地内容推送到 Algolia
安装jq
在服务器中安装jq
用来解析json
文件
# 系统:Centos7
yum install -y epel-release && yum install -y jq
完成配置文件
在项目根目录下创建.env
,docsearch.json
.env
-存放环境变量
ALGOLIA_APP_ID=xxx
ALGOLIA_API_KEY=xxx
docsearch.json
{
// 修改部分
"index_name": "对应上config文件里面的indexName,也是创建的索引名",
"start_urls": ["https://xxx.xxx/"], // 自己的域名网站地址
// 更换自己的域名地址,Docusaurus 官方会有配置生成 sitemap.xml 的方式
"sitemap_urls": ["https://xxx.xxx/sitemap.xml"],
// end
"stop_urls": ["/search"], // 排除不需要爬取页面的路由地址
"selectors": {
"lvl0": {
"selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
"type": "xpath",
"global": true,
"default_value": "Documentation"
},
"lvl1": "header h1, article h1",
"lvl2": "article h2",
"lvl3": "article h3",
"lvl4": "article h4",
"lvl5": "article h5, article td:first-child",
"lvl6": "article h6",
"text": "article p, article li, article td:last-child"
},
"custom_settings": {
"attributesForFaceting": [
"type",
"lang",
"language",
"version",
"docusaurus_tag"
],
"attributesToRetrieve": [
"hierarchy",
"content",
"anchor",
"url",
"url_without_anchor",
"type"
],
"attributesToHighlight": ["hierarchy", "content"],
"attributesToSnippet": ["content:10"],
"camelCaseAttributes": ["hierarchy", "content"],
"searchableAttributes": [
"unordered(hierarchy.lvl0)",
"unordered(hierarchy.lvl1)",
"unordered(hierarchy.lvl2)",
"unordered(hierarchy.lvl3)",
"unordered(hierarchy.lvl4)",
"unordered(hierarchy.lvl5)",
"unordered(hierarchy.lvl6)",
"content"
],
"distinct": true,
"attributeForDistinct": "url",
"customRanking": [
"desc(weight.pageRank)",
"desc(weight.level)",
"asc(weight.position)"
],
"ranking": [
"words",
"filters",
"typo",
"attribute",
"proximity",
"exact",
"custom"
],
"highlightPreTag": "<span class='algolia-docsearch-suggestion--highlight'>",
"highlightPostTag": "</span>",
"minWordSizefor1Typo": 3,
"minWordSizefor2Typos": 7,
"allowTyposOnNumericTokens": false,
"minProximity": 1,
"ignorePlurals": true,
"advancedSyntax": true,
"attributeCriteriaComputedByMinProximity": true,
"removeWordsIfNoResults": "allOptional",
"separatorsToIndex": "_",
"synonyms": [
["js", "javascript"],
["ts", "typescript"]
]
}
}
服务器配置
在服务器创建固定位置存放.env
和 docsearch.json
文件
打开该文件,执行
docker run -it --network 网络名称 --env-file=.env -e "CONFIG=$(cat docsearch.json | jq -r tostring)" algolia/docsearch-scraper
服务器出现> DocSearch: https://……
时说明正在推送本地爬取的内容到algolia
利用GitHub实现自动化部署
在项目的根目录下找到.github/workflows/docsearch.yml
文件(没有则创建一个)
name: docsearch
on:
push:
branches:
- main
jobs:
algolia:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Get the content of docsearch.json as config
id: algolia_config
run: echo "::set-output name=config::$(cat docsearch.json | jq -r tostring)"
- name: Run algolia/docsearch-scraper image
env:
ALGOLIA_APP_ID: ${{ secrets.ALGOLIA_APP_ID }}
ALGOLIA_API_KEY: ${{ secrets.ALGOLIA_API_KEY }}
CONFIG: ${{ steps.algolia_config.outputs.config }}
run: |
docker run \
--env APPLICATION_ID=${ALGOLIA_APP_ID} \
--env API_KEY=${ALGOLIA_API_KEY} \
--env "CONFIG=${CONFIG}" \
algolia/docsearch-scraper
此处的secrets.ALGOLIA_APP_ID
和secrets.ALGOLIA_API_KEY
应在github
的项目里配置对应密钥,详细操作见:配置密钥
可更改github action
触发条件:
-
push
到main
分支触发:on:
push:
branches:
- master -
发布成功后触发:
on: deployment