Text formating and item removing help needed.

Discussion in 'PHP' started by exodus, Jul 11, 2011.

  1. #1
    Not sure how to do this with php.

    I have a text file that I need to fix. I would like to replace any single <br> in the text file with nothing "". but I want to keep any multiple back to back ones.. for example <br><br> or <br><br><br> that exists. How can I do this?

    For example..

    
    <br>A girl went to the market <br>to fetch a loaf of bread.<br><br> She was happy that <br>she had enough money to get the bread.<br><br> When walking home from the <br>store she started to sing a song to herself<br><br><br> After getting home she seen that she already<br> had a loaf of bread in her bread box<br><br> She felt a little silly when she found out<br><br>
    
    Code (markup):
    I would do it by hand, but I have 900,000 lines of this stuff. That is the reason why I would like to automate it.

    I was thinking I could do just a simple search and replace in a text editor. Finding all the <br><br><br> and replacing them with a high ascii char. then search for the <br><br> and doing the same with another high ascii char. Then after that just search/replace all the single <br> left over with nothing. then replace the high ascii char with the <br><br> again and so forth with the other one.

    How would you go about fixing this problem?
     
    exodus, Jul 11, 2011 IP
  2. SiJz

    SiJz Peon

    Messages:
    51
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #2
    The way I've done this before is maybe a long way, but worked well... pretty much as you describe at the end of your post.

    use the str_replace like this, $data_line contains the line fo text you want to sort out

    $data_line = str_replace("<br><br><br><br><br>", "#5BR#", $data_line);
    $data_line = str_replace("<br><br><br><br>", "#4BR#", $data_line);
    $data_line = str_replace("<br><br><br>", "#3BR#", $data_line);
    $data_line = str_replace("<br><br>", "#2BR#", $data_line);

    the #2BR# etc, is just a token to replace them, needs to be something that will not appear in your text
    make sure you work from the highest repeat count down to 2 or bad things will happen!!

    then:
    $data_line = str_replace("<br>", "", $data_line);

    to lose the single <br>

    then

    $data_line = str_replace("#5BR#", "<br><br><br><br><br>", $data_line);
    $data_line = str_replace("#4BR#", "<br><br><br><br>", $data_line);
    $data_line = str_replace("#3BR#", "<br><br><br>", $data_line);
    $data_line = str_replace("#2BR#", "<br><br>", $data_line);

    to restore the multiple BRs

    Hope that helps,

    Si
     
    SiJz, Jul 11, 2011 IP
    exodus likes this.
  3. yusuyi

    yusuyi Peon

    Messages:
    30
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    $str = "<br>A girl went to the market <br>to fetch a loaf of bread.<br><br> She was happy that <br>she had enough money to get the bread.<br><br> When walking home from the <br>store she started to sing a song to herself<br><br><br> After getting home she seen that she already<br> had a loaf of bread in her bread box<br><br> She felt a little silly when she found out<br><br>";


    $patterns = "/[^<br>]<br>[^<br>]/i";

    echo preg_replace($patterns, " ", $str);

    shows:
    <br>A girl went to the marketo fetch a loaf of bread.<br><br> She was happy thathe had enough money to get the bread.<br><br> When walking home from thetore she started to sing a song to herself<br><br><br> After getting home she seen that she alreadhad a loaf of bread in her bread box<br><br> She felt a little silly when she found out<br><br>

    The first <br>you could delete by hand.
     
    yusuyi, Jul 11, 2011 IP