From: Lucian Mogosanu Date: Mon, 14 Oct 2019 15:53:55 +0000 (+0300) Subject: Add draft for mp-wp exploration guide X-Git-Url: https://git.mogosanu.ro/?a=commitdiff_plain;ds=inline;p=thetarpit.git Add draft for mp-wp exploration guide --- diff --git a/drafts/000-mp-wp-exploration-guide.markdown b/drafts/000-mp-wp-exploration-guide.markdown new file mode 100644 index 0000000..07aa1fc --- /dev/null +++ b/drafts/000-mp-wp-exploration-guide.markdown @@ -0,0 +1,94 @@ +--- +postid: 000 +title: A guide to systematically exploring the entrails of MP-WP, illustrated using some weird found in the post editor +date: October 14, 2019 +author: Lucian Mogoșanu +tags: tech +--- + +This seems to be [a recurring theme][lobbes], so I thought it'd be worth looking into it for a bit. My own investigation followed from some weird that I'd discovered while [setting up my blog][welcome], which makes the purpose of this post twofold: on one hand I'm recounting some of my own adventures in MP-WP-land; on the other I'm explaining how to approach the thing in a sane manner, assuming there's an problem to solve there. + +In particular, the problem I stumbled upon sounded like [this][welcome]: + +> So, to explain the scenario in more detail: I'm inputting "ampersand-lt-semicolon b ampersand-gt-semicolon this is a quoted bold text ampersand-lt-semicolon slash b ampersand-gt-semicolon", saving, previewing, which gives same as your description, i.e. "left quote b right quote this is a quoted bold text left quote slash b right quote". +> +> However, upon saving, the text in the box is also converted to "left quote b right quote this is a quoted bold text left quote slash b right quote", so if I save and preview again, this is going to yield "this is a quoted bold text" in bold. + +No really, this one's a pretty weird fucker, isn't it? Also, it hasn't got anything to do with previewing things, the whole thing can be reproduced on my setup using the following steps: + +1. Create a new draft post +2. Input (minimally) the text "derpitude &lt;" (sans quotes) in the content field +3. Press "save draft" + +After the third step, the observed behaviour is that "derpitude &lt;" is magically transformed into "derpitude <", while the expected behaviour is that no transformation should occur. There's a few observations that we can proceed from, let's take them one by one. + +First off, the scenario has two distinct parts, as per the workings of web browsers and the HTTP protocol. In the first part, the user adds some content and then presses a form submit button -- behind the scenes this translates in a HTTP request, which may be examined using e.g. the tools provided in your browser of choice, or by examining the page elements and trying to reproduce the same request using curl. The latter approach is more labour-intensive, so e.g. Firefox's "developer tools" thingie proves to be very useful here. Anyway, once the (in our case POST) request is sent to the server, the server processes it and sends some response. + +Which brings us to the second part of the scenario, in which the server has finished processing, so it cooks up a response, sends it back to the client, which client processes the response and e.g. renders some shit on the screen. Things may get more complicated if you have JavaScript enabled in your browser, which actually begs the question of whether the issue can be reproduced on some setup where JS is disabled, which I've done. Anyway, keep this in mind. + +Now the question becomes: where in this "type text; send POST; process; make response; send response; render" pipeline does the transformation occur? We can easily discard the first two steps: for the first, we have JS disabled, so we're confident that no text manipulation occurs while we type stuff; while for the second, we can look at the request content in the browser as we press the button, and... well, as far as I've seen, it looked exactly as expected, i.e. "derpitude &lt;". + +The second observation that we can proceed from is that the post editor being part of Wordpress' admin interface, we expect that the processing and response cooking code can be found somewhere in the wp-admin directory of our MP-WP installation. I wouldn't know, to be honest: in fact, I'll readily confess I have very little experience with MP-WP, so I'm stuck digging. Digging where, though? We need to start somewhere, don't we? + +Looking at our POST request, we notice that the destination is `wp-admin/post.php`. The code here is a sausage of conditional switches, so it'll probably take a while to figure what goes on in there. One intuition coming out of nowhere is that the HTTP POST request sets an "action" variable to the value "editpost", and that there's also a variable named `$action` in this code -- annoying, but what can I do. The easiest way to test that this code path is exercised is to simply add a + +~~~~ +echo 'is this printed?'; exit(); +~~~~ + +statement to our code and rerun the scenario. Noticing that the code behaves as expected, we delete our print and we start reading the thing line by line. It's pretty short, so let's quote it here: + +~~~~ + $post_ID = (int) $_POST['post_ID']; + check_admin_referer('update-post_' . $post_ID); + + $post_ID = edit_post(); + + redirect_post($post_ID); + + exit(); + break; +~~~~ + +So the post ID is taken from the client input; then the referer is checked; then the `edit_post` function is called. This is interesting, but this "edit post" is nowhere in sight, so we're going to go to the MP-WP installation directory and run[^1]: + +~~~~ +$ grep -rw edit_post | grep function +~~~~ + +which outputs exactly one line, which looks exactly like the definition of our function, in wp-admin/includes/post.php. Anyway, looking at the definition of this `edit_post`, we notice that it does some processing, most of which is a bunch of unknown, then it calls `wp_update_post`, which is a wrapper over `wp_insert_post`, which does some sanitization and updates the database with the new content. Speaking of which, let's take a look at how the database looks: + +~~~~ +mysql> select post_content from wp_posts where ID = 246; ++-----------------------------------------------------+ +| post_content | ++-----------------------------------------------------+ +| derpitude < | ++-----------------------------------------------------+ +1 row in set (0.00 sec) +~~~~ + +This means that the content we've inputted reaches the database as we gave it, so the "post content" side of our scenario works the way we want. This is the good part; the bad part is that we've wasted about half an hour digging into a bunch of code that doesn't reveal anything about our problem. Finally, we've only dug into about half of the code and we don't yet know a thing about how the display side of this loop works. + +There isn't a single way to look at this either. We could start from one end, i.e. finding out where the post editor printing code lies and what it does, or from the other, that is, looking at the generated page sources and trying to reverse from there. Unfortunately there is no such thing as a "best" approach here, you'll just have to pick one and work through it as long as the approach looks reasonable, letting past experience guide you. + +As far as we're concerned, we'll start where we left off, in our "editpost" snippet above. We observe that `edit_post` returns a post ID, which is passed as parameter to `redirect_post`, which sends a redirect to the client, causing the latter to do (this time) a GET with the "post" parameter set to the post ID and the "action" parameter set to "edit". So now we're back to that switch-case sausage, more precisely at the "edit" case; which does a lot of (on the surface) incomprehensible shit, then does the following: + +~~~~ + include('edit-form-advanced.php'); +~~~~ + +i.e. it calls the script that literally prints the page content. So, are you going to read six hundred lines of *that*? + +Of course, we're not interested in all the shit, but we do want to see how our input field is displayed. So we show the page source in the browser, do a search after "derpitude" (why else do you think I put that there?) and find a div id="postdiv" etc. So we... actually, wait, before this, do you notice how our "derpitude" input is displayed there? It's precisely with an "&lt;", only the browser interprets that as a "<" and converts it accordingly when sending the POST, which confirms that the bug is on the display side: that should have really been "&amp;lt;" for the browser to show what we intended it to. + +Anyway, we go back to `edit-form-advanced.php`, search for "postdiv" and we find that it's exactly before a call to a function called `the_editor`, which after some grepping can be found in `wp-includes/general-template.php`. There's not much to see here except some content being displayed, and if we debug-print the content here (using the "echo" method above), we'll see that it looks precisely the same as in the page sources. + +So this is where we stop and think... maybe displayer code should contain a call to PHP's `htmlentities` somewhere? At least that's what I'm thinking, that the database should contain the format as given by the user and the displayer should do all the escaping. But really, is there a contract between this `the_editor` and callers regarding expected input? because the function is called from no less than three places (page, comment and post editor) and there's nothing in the description to give us a hint of what goes where. + +More generally: it's a well known fact that the Wordpress code has been written by idiots, since it lacks a proper spec of how to use those functions; but at the same time, and weirdly enough, it's *very* well structured and hacking it is mostly a matter of grepping for function F that does thing X, laying some prints in there and observing its behaviour under user input. And the process is rather fast too, there's usually no kernel rebuilding involved, so yes, you're encouraged to fuck with things and fuck them up, assuming you don't let the fuckups reach production stage. + +[^1]: The `-r` in grep stands for "recursive". The `-w` stands for "word", i.e. this instructs our grep to match "edit\_post" but not "reedit\_post". There's probably other way to skin this cat, e.g. installing an indexer for C-like languages, but I'll be damned if I'm going to bother with this. + +[lobbes]: http://www.krankendenken.com/2019/10/mp-wp-bot-workplan-now-until-november-3rd/?b=I%20need&e=.#select +[welcome]: http://thetarpit.org/2019/welcome-to-the-tar-pit#comment-130