So-net無料ブログ作成

Wikipediaのデータをダウンロードして使ってやる

いきなり本番データベースサーバーにウィキペディアデータを突っ込むのもなんなんで、ローカルでいらない部分を削ったり、煮たり焼いたりする。

ウィキペディアデータをどういった用途で利用するかにもよるが、はっきり言って、ページタイトルが入ったpage_title、本文が入ったold_text、page_idの3種類のデータがあればいいんで、これらをひとつのテーブルにまとめる。

必要なもの
データベース
ウィキペディアデータ
mysql-connector-java-5.1.39-bin.jar
mwdumper.jar

流れとしては、まずローカルサーバーにデータベース(wikidb)を作成し、page、revision、textの3つのテーブルを定義しウィキペディアデータをインポート。

INNER JOINで内部結合処理を行い、page_id、page_title、old_textの3カラムデータを集結させたひとつのテーブルにまとめる。

内部結合時にどうでもいいアップロード情報や削除情報を削る。

最終的にはワードプレス側からpage_titleカラムに検索をかけ、ヒットに応じて同レコード上にあるテキストデータを抽出して表示させるといった内容。

これに伴いpage_titleをインデックス化。

ここではubuntuで仮想サーバーを構築しmysql 5.7の環境を前提に解説を進める。

CREATE DATABASE wikidb CHARACTER SET utf8;

mysqlコマンドライン上でwikidbというデータベースを作成、文字コードの指定も同時に行う。

show databases;

show databasesで作成したwikidbデータベースを確認

use wikidb;

wikidbデータベースを選択。

テーブル定義文(内容は下記)
-- -- Core of the wiki: each page has an entry here which identifies -- it by title and contains some essential metadata. -- CREATE TABLE /*_*/page ( -- Unique identifier number. The page_id will be preserved across -- edits and rename operations, but not deletions and recreations. page_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, -- A page name is broken into a namespace and a title. -- The namespace keys are UI-language-independent constants, -- defined in includes/Defines.php page_namespace int NOT NULL, -- The rest of the title, as text. -- Spaces are transformed into underscores in title storage. page_title varchar(255) binary NOT NULL, -- Comma-separated set of permission keys indicating who -- can move or edit the page. page_restrictions tinyblob NOT NULL, -- Number of times this page has been viewed. page_counter bigint unsigned NOT NULL default 0, -- 1 indicates the article is a redirect. page_is_redirect tinyint unsigned NOT NULL default 0, -- 1 indicates this is a new entry, with only one edit. -- Not all pages with one edit are new pages. page_is_new tinyint unsigned NOT NULL default 0, -- Random value between 0 and 1, used for Special:Randompage page_random real unsigned NOT NULL, -- This timestamp is updated whenever the page changes in -- a way requiring it to be re-rendered, invalidating caches. -- Aside from editing this includes permission changes, -- creation or deletion of linked pages, and alteration -- of contained templates. page_touched binary(14) NOT NULL default '', -- This timestamp is updated whenever a page is re-parsed and -- it has all the link tracking tables updated for it. This is -- useful for de-duplicating expensive backlink update jobs. page_links_updated varbinary(14) NULL default NULL, -- Handy key to revision.rev_id of the current revision. -- This may be 0 during page creation, but that shouldn't -- happen outside of a transaction... hopefully. page_latest int unsigned NOT NULL, -- Uncompressed length in bytes of the page's current source text. page_len int unsigned NOT NULL, -- content model, see CONTENT_MODEL_XXX constants page_content_model varbinary(32) DEFAULT NULL ) /*$wgDBTableOptions*/; CREATE UNIQUE INDEX /*i*/name_title ON /*_*/page (page_namespace,page_title); CREATE INDEX /*i*/page_random ON /*_*/page (page_random); CREATE INDEX /*i*/page_len ON /*_*/page (page_len); CREATE INDEX /*i*/page_redirect_namespace_len ON /*_*/page (page_is_redirect, page_namespace, page_len); -- -- Every edit of a page creates also a revision row. -- This stores metadata about the revision, and a reference -- to the text storage backend. -- CREATE TABLE /*_*/revision ( -- Unique ID to identify each revision rev_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, -- Key to page_id. This should _never_ be invalid. rev_page int unsigned NOT NULL, -- Key to text.old_id, where the actual bulk text is stored. -- It's possible for multiple revisions to use the same text, -- for instance revisions where only metadata is altered -- or a rollback to a previous version. rev_text_id int unsigned NOT NULL, -- Text comment summarizing the change. -- This text is shown in the history and other changes lists, -- rendered in a subset of wiki markup by Linker::formatComment() rev_comment tinyblob NOT NULL, -- Key to user.user_id of the user who made this edit. -- Stores 0 for anonymous edits and for some mass imports. rev_user int unsigned NOT NULL default 0, -- Text username or IP address of the editor. rev_user_text varchar(255) binary NOT NULL default '', -- Timestamp of when revision was created rev_timestamp binary(14) NOT NULL default '', -- Records whether the user marked the 'minor edit' checkbox. -- Many automated edits are marked as minor. rev_minor_edit tinyint unsigned NOT NULL default 0, -- Restrictions on who can access this revision rev_deleted tinyint unsigned NOT NULL default 0, -- Length of this revision in bytes rev_len int unsigned, -- Key to revision.rev_id -- This field is used to add support for a tree structure (The Adjacency List Model) rev_parent_id int unsigned default NULL, -- SHA-1 text content hash in base-36 rev_sha1 varbinary(32) NOT NULL default '', -- content model, see CONTENT_MODEL_XXX constants rev_content_model varbinary(32) DEFAULT NULL, -- content format, see CONTENT_FORMAT_XXX constants rev_content_format varbinary(64) DEFAULT NULL ) /*$wgDBTableOptions*/ MAX_ROWS=10000000 AVG_ROW_LENGTH=1024; -- In case tables are created as MyISAM, use row hints for MySQL <5.0 to avoid 4GB limit CREATE UNIQUE INDEX /*i*/rev_page_id ON /*_*/revision (rev_page, rev_id); CREATE INDEX /*i*/rev_timestamp ON /*_*/revision (rev_timestamp); CREATE INDEX /*i*/page_timestamp ON /*_*/revision (rev_page,rev_timestamp); CREATE INDEX /*i*/user_timestamp ON /*_*/revision (rev_user,rev_timestamp); CREATE INDEX /*i*/usertext_timestamp ON /*_*/revision (rev_user_text,rev_timestamp); CREATE INDEX /*i*/page_user_timestamp ON /*_*/revision (rev_page,rev_user,rev_timestamp); -- -- Holds text of individual page revisions. -- -- Field names are a holdover from the 'old' revisions table in -- MediaWiki 1.4 and earlier: an upgrade will transform that -- table into the 'text' table to minimize unnecessary churning -- and downtime. If upgrading, the other fields will be left unused. -- CREATE TABLE /*_*/text ( -- Unique text storage key number. -- Note that the 'oldid' parameter used in URLs does *not* -- refer to this number anymore, but to rev_id. -- -- revision.rev_text_id is a key to this column old_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, -- Depending on the contents of the old_flags field, the text -- may be convenient plain text, or it may be funkily encoded. old_text mediumblob NOT NULL, -- Comma-separated list of flags: -- gzip: text is compressed with PHP's gzdeflate() function. -- utf8: text was stored as UTF-8. -- If $wgLegacyEncoding option is on, rows *without* this flag -- will be converted to UTF-8 transparently at load time. -- object: text field contained a serialized PHP object. -- The object either contains multiple versions compressed -- together to achieve a better compression ratio, or it refers -- to another row where the text can be found. old_flags tinyblob NOT NULL ) /*$wgDBTableOptions*/ MAX_ROWS=10000000 AVG_ROW_LENGTH=10240; -- In case tables are created as MyISAM, use row hints for MySQL <5.0 to avoid 4GB limit

上記の内容をprt_tables.sqlで保存してホームにおく。

source prt_tables.sql

wikidbデータベースにテーブルを定義。

show tables from wikidb;

wikidbデータベース内に作成されたテーブルを確認。

wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2

ウィキペディアデータをダウンロード。

java -jar mwdumper.jar --format=sql:1.5 jawiki-latest-pages-articles.xml.bz2 | mysql -u root -p wikidb

ダウンロードが終了したらmwdumperを使いウィキペディアデータをデータベースに突っ込む。

インポートし終わるまではかなりの時間を要する。

続く・・・。


nice!(0) 

nice! 0